Live HTML¶

Advanced Graphics and Data Visualization in R¶

Lecture 06: Trees, Network Diagrams and Genomic Representations¶

0.1.0 An overview of Advanced Graphics and Data Visualization in R¶

"Advanced Graphics and Data Visualization in R" is brought to you by the Centre for the Analysis of Genome Evolution & Function's (CAGEF) bioinformatics training initiative. This CSB1021 was developed to enhance the skills of students with basic backgrounds in R by focusing on available philosophies, methods, and packages for plotting scientific data. While the datasets and examples used in this course will be centred on SARS-CoV-2 epidemiological and genomic data, the lessons learned herein will be broadly applicable.

This lesson is the sixth and final lecture in a 6-part series. The aim for the end of this series is for students to recognize how to import, format, and display data based on their intended message and audience. The format and style of these visualizations will help to identify and convey the key message(s) from their experimental data.

The structure of the class is a code-along style in Jupyter notebooks. At the start of each lecture, skeleton versions of the lecture will be provided for use on the University of Toronto Jupyter Hub so students can program along with the instructor.


0.2.0 Lecture objectives¶

This week will focus on commonly used visualizations related to genomic information. When working with sequencing data, you may commonly wish to compare sequences based on their relationships or relative similarity (trees), by their sequence identity in gene families or potential interactions (graphs), and or more directly their sequence motifs.

At the end of this lecture you will have covered the following topics

  1. Phylogenetic Trees
  2. Network graphs
  3. Genome sequences and markers

0.3.0 A legend for text format in Jupyter markdown¶

grey background - a package, function, code, command or directory. Backticks are also use for in-line code.
italics - an important term or concept or an individual file or folder
bold - heading or a term that is being defined
blue text - named or unnamed hyperlink

... - Within each coding cell this will indicate an area of code that students will need to complete for the code cell to run correctly.

Blue box: A key concept that is being introduced
Yellow box: Risk or caution
Green boxes: Recommended reads and resources to learn Python
Red boxes: A comprehension question which may or may not involve a coding cell. You usually find these at the end of a section.

0.4.0 Data used in this lesson¶

Today's datasets will focus on SARS-CoV-2 variant surveillance data from the which has been tracking published sequenced genomes for the appearance of new strains in North America.

0.4.1 Dataset 1: nextstrain_ncov_north-america_canada_timetree.nwk¶

This is a Newick format data set describing a phylogenetic tree of SARS-CoV-2 strain information.

0.4.2 Dataset 2: nextstrain_ncov_north-america_canada_metadata.tsv¶

Metadata that accompanies the first dataset. It links strain information back to as much geographical and related information as possible.


0.5.0 Packages used in this lesson¶

repr- a package useful for altering some of the attributes of objects related to the R kernel.

tidyverse which has a number of packages including dplyr, tidyr, stringr, forcats and ggplot2

viridis helps to create color-blind palettes for our data visualizations

RColorBrewer has some hlepful palettes that we'll need to colour our data.

ggnewscale will be helpful in generating multiple colour palettes across increasingly complex plots.

treeio, tidytree, and ggtree will be used to help import, parse and plot phylogenetic trees.

tidygraph and ggraph will be used to generate network graph objects and plot them.

ggseqlogo is used for generating sequence logos related to sequence motifs.

qqman is a wrapper package for generating Manhattan plots.

lubridate and zoo help us to work with some date-based information.

In [ ]:
# When do we start installing the packages
Sys.time()

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

# New packages we haven't worked with before
BiocManager::install("ggnewscale")
install.packages("tidytree")

# 230413: May need to do this to get an older version of ggtree to work
install.packages("https://cran.r-project.org/src/contrib/Archive/rvcheck/rvcheck_0.1.3.tar.gz", repos = NULL)
# Install treeio before installing ggtree
BiocManager::install("treeio")

library(remotes)
install_bioc("ggtree", upgrade = "never") # Do not try to update treeio. It won't work on Jupyter Hub

install_bioc("ggtreeExtra", upgrade = "never")

# BiocManager::install("ggtreeExtra", force = TRUE)
install.packages("ggseqlogo")
install.packages("qqman")

# When do we finish?
Sys.time()
In [1]:
# Packages to help tidy our data
library(tidyverse)
library(readxl)

# Packages for the graphical analysis section
library(repr)
library(viridis)
library(RColorBrewer)
library(ggnewscale)

# Packages for today's lecture about trees
library(tidytree)
library(ggtree)
library(ggtreeExtra)
library(treeio)

# Packages for graphs
library(ggraph)
library(tidygraph)

# Packages for sequence logos
library(ggseqlogo)

# Packages for Manhattan plots
library(qqman)

# Date calculation helpers
library(lubridate)
library(zoo)
-- Attaching core tidyverse packages -------------------------------------------------------------------------------------------------------------------------------------- tidyverse 2.0.0 --
v dplyr     1.1.1     v readr     2.1.4
v forcats   1.0.0     v stringr   1.5.0
v ggplot2   3.4.2     v tibble    3.2.1
v lubridate 1.9.2     v tidyr     1.3.0
v purrr     1.0.1     
-- Conflicts -------------------------------------------------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()
i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Loading required package: viridisLite

If you use the ggtree package suite in published research, please
cite the appropriate paper(s):

Guangchuang Yu, Tommy Tsan-Yuk Lam, Huachen Zhu, Yi Guan. Two methods
for mapping and visualizing associated data on phylogeny using ggtree.
Molecular Biology and Evolution. 2018, 35(12):3041-3043.
doi:10.1093/molbev/msy194

Guangchuang Yu. Using ggtree to visualize data on tree-like structures.
Current Protocols in Bioinformatics. 2020, 69:e96. doi:10.1002/cpbi.96



Attaching package: 'tidytree'


The following object is masked from 'package:stats':

    filter


ggtree v3.6.2 For help: https://yulab-smu.top/treedata-book/

If you use the ggtree package suite in published research, please cite
the appropriate paper(s):

Guangchuang Yu, David Smith, Huachen Zhu, Yi Guan, Tommy Tsan-Yuk Lam.
ggtree: an R package for visualization and annotation of phylogenetic
trees with their covariates and other associated data. Methods in
Ecology and Evolution. 2017, 8(1):28-36. doi:10.1111/2041-210X.12628

LG Wang, TTY Lam, S Xu, Z Dai, L Zhou, T Feng, P Guo, CW Dunn, BR
Jones, T Bradley, H Zhu, Y Guan, Y Jiang, G Yu. treeio: an R package
for phylogenetic tree input and output with richly annotated and
associated data. Molecular Biology and Evolution. 2020, 37(2):599-603.
doi: 10.1093/molbev/msz240

Shuangbin Xu, Lin Li, Xiao Luo, Meijun Chen, Wenli Tang, Li Zhan, Zehan
Dai, Tommy T. Lam, Yi Guan, Guangchuang Yu. Ggtree: A serialized data
object for visualization of a phylogenetic tree and annotation data.
iMeta 2022, 1(4):e56. doi:10.1002/imt2.56 


Attaching package: 'ggtree'


The following object is masked from 'package:tidyr':

    expand


ggtreeExtra v1.0.4  For help: https://yulab-smu.top/treedata-book/

If you use ggtreeExtra in published research, please cite the paper:

SB Xu, Z Dai, P Guo, X Fu, S Liu, L Zhou, W Tang, T Feng, M Chen, L Zhan, T Wu, E Hu, G Yu. ggtreeExtra: Compact visualization of richly annotated phylogenetic data. Research Square. doi: 10.21203/rs.3.rs-155672/v2, (preprint).



treeio v1.14.4  For help: https://yulab-smu.top/treedata-book/

If you use treeio in published research, please cite:

LG Wang, TTY Lam, S Xu, Z Dai, L Zhou, T Feng, P Guo, CW Dunn, BR Jones, T Bradley, H Zhu, Y Guan, Y Jiang, G Yu. treeio: an R package for phylogenetic tree input and output with richly annotated and associated data. Molecular Biology and Evolution 2020, 37(2):599-603. doi: 10.1093/molbev/msz240



Attaching package: 'tidygraph'


The following object is masked from 'package:stats':

    filter




For example usage please run: vignette('qqman')



Citation appreciated but not required:

Turner, (2018). qqman: an R package for visualizing GWAS results using Q-Q and manhattan plots. Journal of Open Source Software, 3(25), 731, https://doi.org/10.21105/joss.00731.




Attaching package: 'zoo'


The following objects are masked from 'package:base':

    as.Date, as.Date.numeric



1.0.0 What's in a name? A phylogenetic tree is just a rebranded dendrogram¶

Dendrograms, cladograms, and phylogenetic trees all share a similar structure with some modifications. All are represented in branching tree structures that are used to represent the relationships among the leaves, also known as tips. Branches along the tree may also be referred to as edges.

Leaves can represent different species, strains, or multi-dimensional values. Leaves are connected by branches to their nearest neighbours or relatives. As you move backwards along the tree, you encounter internal nodes which can connect directly to more tips or more nodes. The distance between tips and nodes represent a relative distance that may be defined as some type of evolutionary distance, time, or simple euclidean distance. In a cladogram, however, distances along branches have no meaning and only define the presence of a relationship.

In tree terminology, it is helpful to define a few more terms:

Term Description
Root A tree is rooted when it defines a common ancester for all tips in the tree. This could be considered the start or entry point for the tree
Most recent common ancester (MRCA) This is an internal node that collectively joins two or more tips.
Clade A group of taxa (tips) that includes a common ancestor and all of its descendents.
Scaling This indicates that the branch lengths do represent a distance metric, whereas, an unscaled tree may have even-length branches not representative of evolutionary/relationship distances.
Diagram of phylogenetic tree terms from 10 differences between cladograms and phylogenetics trees

1.1.0 Where does tree data come from?¶

As we saw last week, clustering analysis such as that from hclust() can create dendrogram data. We plotted this in a couple of ways by itself using fviz_dend() from the factoextra package and with heatmap.2() from the gplots package. In the case of our first foray, we were looking at the "relationships" between samples by indicating how similar they were based on the characteristics in their various features.

Depending on the software used, trees can be represented in a number of formats, some of which are described below.

Format Description
Newick The standard used to represent trees in a computer-readable format. Trees are encoded in a parentheses-closure format where each tip takes the form of "taxa:branch-length"
and pairs of tips are separated by a comma , and enclosed by parentheses (). Internal nodes branch lengths are defined outside the parentheses with "():branch-length"
NEXUS/phylip Incorporates the Newick tree with separate related data that is partitioned into different blocks. Blocks are started with "BEGIN \<BLOCK NAME>;" and closed with "END;"
NHX New Hampshire eXtended format is also based on Newick format but instead of code blocks uses a tagging system for each node extending the format to
"taxa:branch-length[&&NHX:<tag information>]"

1.2.0 Using read.tree() from the treeio package¶

When opening tree files, depending on the format, you can use the treeio package and one of it's many parsing functions but the simplest way to avoid figuring out how to open a tree file, is to just use the read.tree() function. This will determing the appropriate file type and parse through the many different tree formats before depositing the data into a phylo object.

Today we'll be working with a dataset from the Auspice COVID-19 North American dataset maintained by Finlay Maguire. Unfortunately this dataset is no longer available but we have an older copy with a fair amount of metadata.

Let's begin by opening up the tree data and some related metadata before taking a look under the hood.

In [2]:
# Import our Newick Tree
SC2_variants_time.phylo <- read.tree("data/nextstrain_ncov_north-america_canada_timetree.nwk")

# Import our metadata
SC2_metadata.df <- read_tsv("data/nextstrain_ncov_north-america_canada_metadata.tsv")
Rows: 6627 Columns: 24
-- Column specification ----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Delimiter: "\t"
chr  (21): Strain, GISAID clade, Nextstrain clade, Clade, Country, Admin Div...
dbl   (1): Age
lgl   (1): url
date  (1): Collection Data

i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.
In [3]:
# What does the tree look like?
str(SC2_variants_time.phylo)

SC2_variants_time.phylo
List of 6
 $ edge       : int [1:12776, 1:2] 6628 6628 6629 6630 6631 6630 6632 6632 6633 6629 ...
 $ edge.length: num [1:12776] 0.0757 0.0848 0.0336 0.0501 0 ...
 $ Nnode      : int 6150
 $ node.label : chr [1:6150] "NODE_0008606" "NODE_0000001" "NODE_0000061" "USA/MA1/2020_travel_history" ...
 $ tip.label  : chr [1:6627] "Wuhan/WH01/2019" "USA/MA1/2020" "Canada/ON_ON-VIDO-01-2/2020" "Canada/ON_VIDO-01/2020" ...
 $ root.edge  : num 2020
 - attr(*, "class")= chr "phylo"
 - attr(*, "order")= chr "cladewise"
Phylogenetic tree with 6627 tips and 6150 internal nodes.

Tip labels:
  Wuhan/WH01/2019, USA/MA1/2020, Canada/ON_ON-VIDO-01-2/2020, Canada/ON_VIDO-01/2020, USA/CA-CDC-5/2020, Taiwan/CGMH-CGU-02/2020, ...
Node labels:
  NODE_0008606, NODE_0000001, NODE_0000061, USA/MA1/2020_travel_history, NODE_0000062, Canada/ON_VIDO-01/2020_travel_history, ...

Rooted; includes branch lengths.

1.2.1 The phylo object is a simple list¶

From the looks of our structure, after reading in our Newick file, we have a list with six elements. They are pretty clearly named with edge information (a matrix of paired numbers), edge lengths, a count of the number of nodes (ie internal points where branches bifurcate. While mostly just numbers, some of these node.label values appear to be based on travel history information. All of the 6,627 tip.label values correspond to 6,627 SARS-CoV-2 strain names found in the metadata file as well.


1.2.2 Convert the tree object to a tibble for adding external information¶

Suppose we want to combine some information from our metadata with our tree? We can then use this information to colour, shape, and bring more visual order to our tree. Working with the tree, however, can be a little hard given that we have lists of different lengths representing different portions of the tree.

In our case, we only have tree tip information that we want to add to and the easiest way to work would be if it was in a table format. The package tidytree provides a function as_tibble that will parse and convert the phylo format to a tbl_tree object, which is a kind of tidy data frame.

Let's convert our variant phylo object to something we can work with.

In [4]:
# Convert the phylo object using as_tibble
SC2_variants_time.tb <- as_tibble(SC2_variants_time.phylo)

# Check out the structure now
str(SC2_variants_time.tb)

head(SC2_variants_time.tb)
tbl_tree [12,777 x 4] (S3: tbl_tree/tbl_df/tbl/data.frame)
 $ parent       : int [1:12777] 6628 6631 6632 6633 6636 6635 6638 6638 6639 6640 ...
 $ node         : int [1:12777] 1 2 3 4 5 6 7 8 9 10 ...
 $ branch.length: num [1:12777] 0.0757 0 0.0077 0 0 ...
 $ label        : chr [1:12777] "Wuhan/WH01/2019" "USA/MA1/2020" "Canada/ON_ON-VIDO-01-2/2020" "Canada/ON_VIDO-01/2020" ...
A tbl_tree: 6 × 4
parentnodebranch.lengthlabel
<int><int><dbl><chr>
662810.075652198Wuhan/WH01/2019
663120.000000000USA/MA1/2020
663230.007696891Canada/ON_ON-VIDO-01-2/2020
663340.000000000Canada/ON_VIDO-01/2020
663650.000000000USA/CA-CDC-5/2020
663560.044204648Taiwan/CGMH-CGU-02/2020

1.3.0 Updating our tree with external information¶

As you can see our SC2_variants_time.tb object is a simple data frame now represented as a 4-column table with 12,777 rows. The edge data is encoded between the parent and node columns with branch.length to define the length of each edge. We also have all of the tip and node names stored under the label column. With the tree in this format, we can now treat the tree like a tibble and join additional information to the tree.

Rules to keep in mind:

  1. Don't lose any node information from your original tree structure. Losing "observations" is essentially losing edges of your tree!
  2. Try to work with a complete dataset of information although sometimes it doesn't make sense that you would have it all.
  3. Convert your tibble back to a tree format when you're done with as.treedata()

You may have information about the leaves or tips of your tree but by default there is no real internal node information when it comes to sample origins or lineage. When joining information to the tables, use the correct direction *_join() to avoid data loss. A full_join() is probably the safest.

Let's take a quick look at our metadata and identify what we're interested in.

In [5]:
# Look at the metadata, which bits of information would be nice to add?
str(SC2_metadata.df, give.attr = FALSE)
spc_tbl_ [6,627 x 24] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Strain                    : chr [1:6627] "Wuhan/WH01/2019" "USA/MA1/2020" "Canada/ON_ON-VIDO-01-2/2020" "Canada/ON_VIDO-01/2020" ...
 $ GISAID clade              : chr [1:6627] "L" "L" "L" "L" ...
 $ Nextstrain clade          : chr [1:6627] "19A" "19A" "19A" "19A" ...
 $ Age                       : num [1:6627] 44 21 NA NA 54 NA 47 47 23 55 ...
 $ Clade                     : chr [1:6627] "19A" "19A" "19A" "19A" ...
 $ Country                   : chr [1:6627] "Asia" "USA" "Canada" "Canada" ...
 $ Admin Division            : chr [1:6627] "Asia" "Massachusetts" "Ontario" "Ontario" ...
 $ Admin Division of exposure: chr [1:6627] "Asia" "Asia" "Ontario" "Asia" ...
 $ gisaid_epi_isl            : chr [1:6627] "EPI_ISL_406798" "EPI_ISL_409067" "EPI_ISL_425177" "EPI_ISL_413015" ...
 $ Host                      : chr [1:6627] "Human" "Human" "Human" "Human" ...
 $ Originating Lab           : chr [1:6627] "General Hospital of Central Theater Command of People's Liberation Army of China" "Massachusetts Department of Public Health" "Public Health Ontario" "Public Health Ontario Laboratory" ...
 $ PANGO lineage             : chr [1:6627] "B" "B" "B" "B" ...
 $ Submission Date           : chr [1:6627] "Older" "Older" "Older" "Older" ...
 $ Region                    : chr [1:6627] "Asia" "North America" "North America" "North America" ...
 $ Region of exposure        : chr [1:6627] "Asia" "Asia" "North America" "Asia" ...
 $ Sex                       : chr [1:6627] "Male" "Male" "Male" "Male" ...
 $ strain                    : chr [1:6627] "Wuhan/WH01/2019" "USA/MA1/2020" "Canada/ON_ON-VIDO-01-2/2020" "Canada/ON_VIDO-01/2020" ...
 $ Emerging Nextstrain clade : chr [1:6627] "19A" "19A" "19A" "19A" ...
 $ Submitting Lab            : chr [1:6627] "BGI & Institute of Microbiology, Chinese Academy of Sciences & Shandong First Medical University & Shandong Aca"| __truncated__ "Pathogen Discovery, Respiratory Viruses Branch, Division of Viral Diseases, Centers for Disease Control and Prevention" "Public Health Agency of Canada - National Microbiology Laboratory" "National Microbiology Laboratory" ...
 $ url                       : logi [1:6627] NA NA NA NA NA NA ...
 $ Collection Data           : Date[1:6627], format: "2019-12-26" "2020-01-29" ...
 $ Author                    : chr [1:6627] "Weijun Chen et al (https://dx.doi.org/10.1016/S0140-6736(20)30251-8)" "Clinton R. Paden et al (https://dx.doi.org/10.1101/2020.03.09.20032896)" "Amrit S. Boese et al" "Shari Tyson et al" ...
 $ Country of exposure       : chr [1:6627] NA "Asia" NA "Asia" ...
 $ Location                  : chr [1:6627] NA "Boston" NA NA ...
In [6]:
# Check out some of the values for these variables

# Metadata information on the samples
unique(SC2_metadata.df$Country)

unique(SC2_metadata.df$Age)

# Metadata information on the genomes
unique(SC2_metadata.df$'GISAID clade')

unique(SC2_metadata.df$'Nextstrain clade')

unique(SC2_metadata.df$Clade)

unique(SC2_metadata.df$'PANGO lineage')
  1. 'Asia'
  2. 'USA'
  3. 'Canada'
  4. 'Europe'
  5. 'Guatemala'
  6. 'Panama'
  7. 'Africa'
  8. 'Dominican Republic'
  9. 'Costa Rica'
  10. 'South America'
  11. 'Oceania'
  12. 'Jamaica'
  13. 'Mexico'
  14. 'Barbados'
  15. 'Guadeloupe'
  16. 'Saint Martin'
  17. 'Bermuda'
  18. 'Sint Maarten'
  19. 'Saint Barthélemy'
  20. 'Belize'
  21. 'Saint Lucia'
  22. 'El Salvador'
  23. 'Cuba'
  1. 44
  2. 21
  3. <NA>
  4. 54
  5. 47
  6. 23
  7. 55
  8. 62
  9. 49
  10. 50
  11. 43
  12. 25
  13. 26
  14. 33
  15. 38
  16. 9
  17. 73
  18. 64
  19. 57
  20. 58
  21. 0
  22. 28
  23. 70
  24. 71
  25. 20
  26. 93
  27. 36
  28. 37
  29. 40
  30. 56
  31. 89
  32. 45
  33. 34
  34. 53
  35. 94
  36. 95
  37. 86
  38. 84
  39. 60
  40. 90
  41. 27
  42. 30
  43. 59
  44. 41
  45. 46
  46. 52
  47. 51
  48. 32
  49. 24
  50. 35
  51. 29
  52. 61
  53. 79
  54. 66
  55. 78
  56. 11
  57. 2
  58. 65
  59. 75
  60. 31
  61. 48
  62. 74
  63. 68
  64. 67
  65. 63
  66. 77
  67. 42
  68. 22
  69. 12
  70. 5
  71. 17
  72. 80
  73. 16
  74. 39
  75. 82
  76. 8
  77. 7
  78. 88
  79. 76
  80. 69
  81. 19
  82. 87
  83. 3
  84. 14
  85. 18
  86. 1
  87. 81
  88. 15
  89. 6
  90. 10
  91. 85
  92. 4
  93. 98
  94. 91
  95. 72
  96. 99
  97. 13
  98. 92
  99. 83
  100. 96
  101. 104
  1. 'L'
  2. 'S'
  3. 'O'
  4. 'V'
  5. 'GR'
  6. 'G'
  7. 'GV'
  8. 'GH'
  9. 'GRY'
  10. NA
  1. '19A'
  2. '19B'
  3. NA
  4. '20A'
  5. '20B'
  6. '20E (EU1)'
  7. '20A.EU2'
  8. '20C'
  9. '20G'
  10. '20H/501Y.V2'
  11. '20D'
  12. '20J/501Y.V3'
  13. '20F'
  14. '20I/501Y.V1'
  1. '19A'
  2. '19B'
  3. '20A'
  4. '20E (EU1)'
  5. '20C'
  6. '20G'
  7. '20H/501Y.V2'
  8. '20B'
  9. '20D'
  10. '20J/501Y.V3'
  11. '20F'
  12. '20I/501Y.V1'
  1. 'B'
  2. 'A'
  3. 'A.6'
  4. 'A.2'
  5. 'A.2.4'
  6. 'A.2.5'
  7. 'A.5'
  8. 'A.1'
  9. 'A.21'
  10. 'A.18'
  11. 'A.12'
  12. 'A.28'
  13. 'A.19'
  14. 'A.25'
  15. 'A.23'
  16. 'A.23.1'
  17. 'B.12'
  18. 'B.56'
  19. 'B.4.7'
  20. 'B.31'
  21. 'B.61'
  22. 'B.40'
  23. 'B.35'
  24. 'B.28'
  25. 'B.11'
  26. 'B.55'
  27. 'B.27'
  28. 'B.3'
  29. 'B.4'
  30. 'B.4.8'
  31. 'B.4.2'
  32. 'B.53'
  33. 'B.6'
  34. 'B.1'
  35. 'B.6.6'
  36. 'B.6.1'
  37. 'B.6.2'
  38. 'B.6.8'
  39. 'B.1.256'
  40. 'B.1.1'
  41. 'B.1.1.420'
  42. 'B.1.240'
  43. 'B.1.205'
  44. 'B.1.128'
  45. 'B.1.221'
  46. 'B.1.223'
  47. 'B.1.214.2'
  48. 'B.1.214'
  49. 'B.1.146'
  50. 'B.1.147'
  51. 'B.1.104'
  52. 'B.1.391'
  53. 'B.1.380'
  54. 'B.1.203'
  55. 'B.1.540'
  56. 'B.1.416'
  57. 'B.1.600'
  58. 'B.1.160'
  59. 'B.1.195'
  60. 'B.1.609'
  61. 'B.1.561'
  62. 'B.1.232'
  63. 'B.1.558'
  64. 'B.1.239'
  65. 'B.1.237'
  66. 'B.1.597'
  67. 'B.1.480'
  68. 'B.1.397'
  69. 'B.1.408'
  70. 'B.1.243'
  71. 'B.1.240.2'
  72. 'B.1.240.1'
  73. 'B.1.524'
  74. 'B.1.179'
  75. 'B.1.1.298'
  76. 'B.1.400'
  77. 'B.1.398'
  78. 'B.1.236'
  79. 'B.1.234'
  80. 'B.1.459'
  81. 'B.1.131'
  82. 'B.1.525'
  83. 'B.1.563'
  84. 'B.1.411'
  85. 'B.1.379'
  86. 'B.1.177.31'
  87. 'B.1.177'
  88. 'B.1.177.73'
  89. 'B.1.177.11'
  90. 'B.1.177.14'
  91. 'B.1.177.15'
  92. 'AA.1'
  93. 'B.1.177.44'
  94. 'B.1.177.39'
  95. 'B.1.177.40'
  96. 'B.1.177.35'
  97. 'B.1.177.82'
  98. 'B.1.177.87'
  99. 'B.1.177.75'
  100. 'B.1.177.45'
  101. 'B.1.177.28'
  102. 'B.1.177.32'
  103. 'B.1.177.37'
  104. 'B.1.177.72'
  105. 'B.1.117.2'
  106. 'B.1.177.4'
  107. 'B.1.177.53'
  108. 'W.2'
  109. 'W.1'
  110. 'B.1.177.50'
  111. 'B.1.177.81'
  112. 'B.1.177.58'
  113. 'B.1.177.55'
  114. 'B.1.177.56'
  115. 'B.1.177.57'
  116. 'B.1.177.51'
  117. 'B.1.177.52'
  118. 'B.1.177.21'
  119. 'B.1.177.77'
  120. 'B.1.177.6'
  121. 'B.1.177.7'
  122. 'B.1.177.86'
  123. 'B.1.177.8'
  124. 'B.1.177.62'
  125. 'B.1.177.60'
  126. 'U.3'
  127. 'B.1.177.79'
  128. 'B.1.9.5'
  129. 'B.1.9'
  130. 'B.1.439'
  131. 'B.1.111'
  132. 'B.1.162'
  133. 'B.1.438'
  134. 'B.1.164'
  135. 'B.1.36'
  136. 'B.1.441'
  137. 'B.1.36.35'
  138. 'B.1.468'
  139. 'B.1.462'
  140. 'B.1.471'
  141. 'B.1.469'
  142. 'B.1.466'
  143. 'B.1.466.2'
  144. 'B.1.470'
  145. 'B.1.160.16'
  146. 'B.1.456'
  147. 'B.1.36.29'
  148. 'B.1.36.31'
  149. 'B.1.36.1'
  150. 'B.1.36.8'
  151. 'B.1.36.26'
  152. 'B.1.36.34'
  153. 'B.1.36.7'
  154. 'B.1.36.16'
  155. 'B.1.36.36'
  156. 'B.1.36.38'
  157. 'B.1.36.37'
  158. 'B.1.36.18'
  159. 'B.1.36.10'
  160. 'B.1.476'
  161. 'B.1.281'
  162. 'B.1.273'
  163. 'B.1.544'
  164. 'B.1.270'
  165. 'B.1.277'
  166. 'B.1.279'
  167. 'B.1.314'
  168. 'B.1.367'
  169. 'B.1.370'
  170. 'B.1.413'
  171. 'B.1.316'
  172. 'B.1.3'
  173. 'B.1.509'
  174. 'B.1.504'
  175. 'B.1.362'
  176. 'B.1.499'
  177. 'B.1.615'
  178. 'B.1.611'
  179. 'B.1.505'
  180. 'B.1.422'
  181. 'B.1.596.1'
  182. 'B.1.362.2'
  183. 'B.1.369'
  184. 'B.1.564'
  185. 'B.1.564.1'
  186. 'B.1.369.1'
  187. 'B.1.1.1'
  188. 'B.1.428'
  189. 'B.1.356'
  190. 'B.1.601'
  191. 'B.1.562'
  192. 'B.1.360'
  193. 'B.1.311'
  194. 'B.1.598'
  195. 'B.1.350'
  196. 'B.1.426'
  197. 'B.1.265'
  198. 'B.1.336'
  199. 'B.1.589'
  200. 'B.1.567'
  201. ...
  202. 'B.1.178'
  203. 'B.1.220'
  204. 'B.1.6'
  205. 'B.1.84'
  206. 'B.1.1.170'
  207. 'B.1.93'
  208. 'B.1.91'
  209. 'B.1.219'
  210. 'B.1.211'
  211. 'B.1.388'
  212. 'B.1.210'
  213. 'B.1.555'
  214. 'B.1.206'
  215. 'B.1.260'
  216. 'B.1.420'
  217. 'B.1.192'
  218. 'B.1.258'
  219. 'B.1.258.17'
  220. 'B.1.242'
  221. 'B.1.126'
  222. 'B.1.378'
  223. 'B.1.565'
  224. 'B.1.139'
  225. 'B.1.8'
  226. 'B.1.78'
  227. 'B.1.396'
  228. 'B.1.560'
  229. 'B.1.527'
  230. 'B.1.142'
  231. 'B.1.393'
  232. 'B.1.187'
  233. 'B.1.149'
  234. 'B.1.549'
  235. 'B.1.545'
  236. 'B.1.189'
  237. 'B.1.528'
  238. 'B.1.23'
  239. 'B.1.67'
  240. 'B.1.547'
  241. 'P.2'
  242. 'B.1.1.28'
  243. 'B.1.390'
  244. 'B.1.395'
  245. 'B.1.530'
  246. 'C.35'
  247. 'B.1.233'
  248. 'B.1.543'
  249. 'B.6.7'
  250. 'B.1.1.369'
  251. 'B.1.1.372'
  252. 'C.8'
  253. 'C.36'
  254. 'C.36.1'
  255. 'C.16'
  256. 'C.23'
  257. 'C.1.1'
  258. 'B.1.1.375'
  259. 'C.2.1'
  260. 'C.12'
  261. 'B.1.1.161'
  262. 'B.1.1.87'
  263. 'B.1.1.192'
  264. 'B.1.1.71'
  265. 'B.1.1.10'
  266. 'L.3'
  267. 'L.1'
  268. 'B.1.1.176'
  269. 'B.1.1.398'
  270. 'B.1.1.417'
  271. 'B.1.1.63'
  272. 'B.1.1.422'
  273. 'B.1.1.487'
  274. 'B.1.1.312'
  275. 'B.1.1.331'
  276. 'B.1.1.58'
  277. 'B.1.1.277'
  278. 'K.1'
  279. 'B.1.1.30'
  280. 'B.1.1.516'
  281. 'B.1.1.50'
  282. 'B.1.1.33'
  283. 'N.2'
  284. 'N.3'
  285. 'N.7'
  286. 'N.5'
  287. 'N.8'
  288. 'N.4'
  289. 'B.1.1.409'
  290. 'B.1.1.61'
  291. 'B.1.1.220'
  292. 'R.1'
  293. 'B.1.1.317'
  294. 'B.1.1.464.1'
  295. 'B.1.1.316'
  296. 'B.1.1.254'
  297. 'B.1.1.267'
  298. 'B.1.1.241'
  299. 'B.1.1.25'
  300. 'B.1.1.181'
  301. 'B.1.1.434'
  302. 'B.1.1.243'
  303. 'B.1.1.159'
  304. 'B.1.1.411'
  305. 'B.1.1.232'
  306. 'B.1.1.459'
  307. 'B.1.1.111'
  308. 'B.1.1.294'
  309. 'B.1.1.373'
  310. 'B.1.1.413'
  311. 'B.1.1.354'
  312. 'B.1.1.54'
  313. 'B.1.1.325'
  314. 'B.1.1.399'
  315. 'B.1.1.46'
  316. 'B.1.1.274'
  317. 'B.1.1.410'
  318. 'B.1.1.315'
  319. 'AD.2'
  320. 'B.1.1.142'
  321. 'B.1.1.485'
  322. 'B.1.1.200'
  323. 'B.1.1.339'
  324. 'AL.1'
  325. 'B.1.1.163'
  326. 'B.1.1.378'
  327. 'B.1.1.39'
  328. 'B.1.1.440'
  329. 'B.1.1.297'
  330. 'B.1.1.397'
  331. 'B.1.1.207'
  332. 'B.1.1.391'
  333. 'B.1.1.348'
  334. 'B.1.1.121'
  335. 'B.1.1.512'
  336. 'B.1.1.136'
  337. 'B.1.1.288'
  338. 'B.1.1.368'
  339. 'B.1.1.328'
  340. 'B.1.1.324'
  341. 'B.1.1.360'
  342. 'B.1.1.359'
  343. 'B.1.1.374'
  344. 'B.1.1.12'
  345. 'B.1.1.407'
  346. 'B.1.1.500'
  347. 'B.1.1.432'
  348. 'B.1.301'
  349. 'B.1.1.37'
  350. 'B.1.1.396'
  351. 'B.1.1.330'
  352. 'B.1.1.280'
  353. 'B.1.1.141'
  354. 'B.1.1.273'
  355. 'B.1.1.514'
  356. 'B.1.1.311'
  357. 'B.1.1.377'
  358. 'B.1.1.351'
  359. 'B.1.1.350'
  360. 'B.1.1.266'
  361. 'B.1.1.225'
  362. 'B.1.1.371'
  363. 'B.1.1.438'
  364. 'B.1.1.153'
  365. 'B.1.1.134'
  366. 'B.1.1.439'
  367. 'B.1.1.152'
  368. 'B.1.1.275'
  369. 'B.1.406'
  370. 'B.1.1.261'
  371. 'P.1'
  372. 'B.1.1.263'
  373. 'B.1.1.349'
  374. 'B.1.1.214'
  375. 'B.1.326'
  376. 'B.1.1.269'
  377. 'D.2'
  378. 'B.1.1.381'
  379. 'B.1.1.448'
  380. 'B.1.1.388'
  381. 'B.1.1.157'
  382. 'B.1.1.394'
  383. 'B.1.1.218'
  384. 'B.1.1.27'
  385. 'B.1.1.70'
  386. 'B.1.1.204'
  387. 'B.1.1.301'
  388. 'B.1.1.326'
  389. 'B.1.1.120'
  390. 'B.1.1.389'
  391. 'B.1.1.401'
  392. 'B.1.1.404'
  393. 'B.1.1.416'
  394. 'B.1.1.216'
  395. 'AM.1'
  396. 'AM.4'
  397. 'B.1.1.34'
  398. 'B.1.1.198'
  399. 'B.1.1.319'
  400. 'B.1.1.7'
  401. NA

1.3.1 Pick the tip attributes that are most interesting or relevant¶

For our purposes it looks like we will want to explore our data with:

  • GISAID clade: A global initiative on sharing avian influenza data scientific consortium of clade naming.
  • Nextstrain clade: Computationally labeled clade information based on the Nextstrain criteria and analysis of new and continuing strains from COVID genomic data.
  • Emerging Nextstrain clade
  • PANGO lineage: a dynamic nomenclature of SARS-CoV-2 lineages.
  • Country: The country where the sample was collected.
  • Admin Division: geographic location of the particular strain/case.
  • Age: the age of the patients from which the sample was collected (if included)
  • Collection Data: the date the data was published or collected.
  • Strain: the name of the strain which we'll need for merging with our tree structure.

Let's pull that information down and add it to our tbl_tree before converting it to a treedata object.

In [7]:
# Add phylogenetic information to our tree
SC2_variants_time.tree <-

# Pass along the metadata
SC2_metadata.df %>% 

# Select just a handful of attributes to go to the tree
select(Strain, 'GISAID clade', 'Emerging Nextstrain clade', 
       'PANGO lineage', Country, 'Admin Division', 'Collection Data', Age) %>% 

rename(strain = Strain,
       GISAID = 'GISAID clade', 
       emerging_nextstrain = 'Emerging Nextstrain clade',
       PANGO = 'PANGO lineage',
       strain_country = Country,
       strain_division = 'Admin Division', 
       strain_date = 'Collection Data') %>% 

# full_join to our tree to ensure no data is lost although you could also carefully use a left_join
full_join(x=SC2_variants_time.tb, y=., by=c("label" = "strain")) %>% 

# Convert to a tidytree format
as.treedata()

# Look at the resulting structure
str(SC2_variants_time.tree)
Formal class 'treedata' [package "tidytree"] with 11 slots
  ..@ file       : chr(0) 
  ..@ treetext   : chr(0) 
  ..@ phylo      :List of 5
  .. ..$ edge       : int [1:12776, 1:2] 6628 6631 6632 6633 6636 6635 6638 6638 6639 6640 ...
  .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. ..$ : NULL
  .. .. .. ..$ : chr [1:2] "parent" "node"
  .. ..$ edge.length: num [1:12776] 0.0757 0 0.0077 0 0 ...
  .. ..$ tip.label  : chr [1:6627] "Wuhan/WH01/2019" "USA/MA1/2020" "Canada/ON_ON-VIDO-01-2/2020" "Canada/ON_VIDO-01/2020" ...
  .. ..$ node.label : chr [1:6150] "NODE_0008606" "NODE_0000001" "NODE_0000061" "USA/MA1/2020_travel_history" ...
  .. ..$ Nnode      : int 6150
  .. ..- attr(*, "class")= chr "phylo"
  ..@ data       : tibble [12,777 x 8] (S3: tbl_df/tbl/data.frame)
  .. ..$ node               : int [1:12777] 1 2 3 4 5 6 7 8 9 10 ...
  .. ..$ GISAID             : chr [1:12777] "L" "L" "L" "L" ...
  .. ..$ emerging_nextstrain: chr [1:12777] "19A" "19A" "19A" "19A" ...
  .. ..$ PANGO              : chr [1:12777] "B" "B" "B" "B" ...
  .. ..$ strain_country     : chr [1:12777] "Asia" "USA" "Canada" "Canada" ...
  .. ..$ strain_division    : chr [1:12777] "Asia" "Massachusetts" "Ontario" "Ontario" ...
  .. ..$ strain_date        : Date[1:12777], format: "2019-12-26" "2020-01-29" ...
  .. ..$ Age                : num [1:12777] 44 21 NA NA 54 NA 47 47 23 55 ...
  ..@ extraInfo  : tibble [0 x 0] (S3: tbl_df/tbl/data.frame)
 Named list()
  ..@ tip_seq    : NULL
  ..@ anc_seq    : NULL
  ..@ seq_type   : chr(0) 
  ..@ tipseq_file: chr(0) 
  ..@ ancseq_file: chr(0) 
  ..@ info       : list()

1.3.1.1 How to access S4 object types¶

Take a quick look at the structure of this treedata object. What do we see? The treedata class has converted our tibble back into an S4 object with 11 slots. Each slot, is essentially a placeholder for another object. We can access these slots using the @ operator and we can further access sub elements with the $ operator.

For instance, we can see there is a phylo slot which looks suspiciously like our original SC2_variants_time.phylo object. The rest of our data frame information, which we just added to the tbl_tree before converting it is stored in the data slot. There are additional slots where we can add sequencing information for comparison in different types of phylogenetic tree visualizations.

In [8]:
# Grab our phylo object
SC2_variants_time.tree@phylo

# Take a peek at our associated node data
head(SC2_variants_time.tree@data)
Phylogenetic tree with 6627 tips and 6150 internal nodes.

Tip labels:
  Wuhan/WH01/2019, USA/MA1/2020, Canada/ON_ON-VIDO-01-2/2020, Canada/ON_VIDO-01/2020, USA/CA-CDC-5/2020, Taiwan/CGMH-CGU-02/2020, ...
Node labels:
  NODE_0008606, NODE_0000001, NODE_0000061, USA/MA1/2020_travel_history, NODE_0000062, Canada/ON_VIDO-01/2020_travel_history, ...

Unrooted; includes branch lengths.
A tibble: 6 × 8
nodeGISAIDemerging_nextstrainPANGOstrain_countrystrain_divisionstrain_dateAge
<int><chr><chr><chr><chr><chr><date><dbl>
1L19ABAsia Asia 2019-12-2644
2L19ABUSA Massachusetts2020-01-2921
3L19ABCanadaOntario 2020-01-23NA
4L19ABCanadaOntario 2020-01-23NA
5L19ABUSA California 2020-01-2954
6S19ABAsia Asia 2020-02-04NA

1.4.0 Plot your tree with ggtree() from the ggtree package¶

Now that we've put some extra data into our tree, we are ready to plot it with the help of the ggtree() function from the ggtree package. If you haven't noticed yet, treeio, tidytree and ggtree form a suite of packages that we can use to import, alter, and visualize our trees.

Parameters for ggtree() include:

  • tr: the phylo tree object
  • layout: the shape of the tree including: rectangular, dendrogram, slanted, ellipse, fan, circular, inward_circular, and radial.
  • mrsd: most recent sampling date used for setting a time-based scale
  • as.Date: logical to specify if date will be in calendar vs decimal-date format
  • branch.length: variable for scaling branch length
  • root.position: the position of our root node (default = 0)

We'll also use theme_tree2() which will automatically add an x-axis scale to our tree.

We'll start simple by just visualizing all 6,627 tips of our tree.

In [9]:
# Adjust our plot window size according to the expected output
options(repr.plot.width=20, repr.plot.height=20)

# Show the entire tree

SC2_variants_time.tree %>% 

# 1. Data
ggtree(tr = .) + 
    # 2. Aesthetics
    aes(color = emerging_nextstrain) + 
    
    # Themes
    theme_tree2() + 
    theme(text = element_text(size = 30),
          legend.position = "bottom") + # Move our legend to the bottom

    # Provide some labels
    labs(x="Time",
         title = "Canada and US sequenced strain phylogeny") +

    # Make our guide lines a little thicker 
    guides(colour = guide_legend(title = "Nextstrain\nclade",  
                                          override.aes = list(size=2)))
! The tree contained negative edge lengths. If you want to ignore the edges, you can set `options(ignore.negative.edge=TRUE)`, then re-run ggtree.


1.5.0 Filter your tree using drop.tip()¶

Looking at our visualization, there are a lot of tips to display and we've coloured the tree along the branches. Another thing we can see is that there is an "NA" value. This mostly represents the branches of the tree connecting nodes internally since they do not have a specific lineage associated from our data.

As it stands, there's just way too much data here to look at and work with. We can trim that data using the drop.tip() function without worrying about how the tree is parsed. Internally, all that pruning will take place within the function. To do this, we'll generate a list of tips we want to remove first.

In [10]:
# You can make a list of tips to drop when you're making your ggtree

# Generate a table of tree information to drop
to_drop <-

SC2_metadata.df %>% 

# Filter for strains that are not from Canada and are from before 2020-11-01
filter(!(Country == "Canada" &
         `Collection Data` >= as.Date("2020-11-01")
        )) %>%

# Just keep the strain information
pull(Strain)

# Take a look at what we got back
str(to_drop)
 chr [1:6199] "Wuhan/WH01/2019" "USA/MA1/2020" ...
In [11]:
# At the same time make a dataset to keep. You'll use this to update the information in the tree later
to_keep <- SC2_metadata.df %>% filter(!Strain %in% to_drop)

# Which tips are we keeping?
str(to_keep, give.attr = FALSE)
spc_tbl_ [428 x 24] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Strain                    : chr [1:428] "Canada/NS-NML-16223/2021" "Canada/NB-NML-3294/2021" "Canada/NS-NML-16211/2021" "Canada/NB-NML-5584/2021" ...
 $ GISAID clade              : chr [1:428] "S" "S" "S" "S" ...
 $ Nextstrain clade          : chr [1:428] "19B" "19B" "19B" "19B" ...
 $ Age                       : num [1:428] 23 25 35 29 61 79 NA NA NA NA ...
 $ Clade                     : chr [1:428] "19B" "19B" "19B" "19B" ...
 $ Country                   : chr [1:428] "Canada" "Canada" "Canada" "Canada" ...
 $ Admin Division            : chr [1:428] "Nova Scotia" "New Brunswick" "Nova Scotia" "New Brunswick" ...
 $ Admin Division of exposure: chr [1:428] "Nova Scotia" "New Brunswick" "Nova Scotia" "New Brunswick" ...
 $ gisaid_epi_isl            : chr [1:428] "EPI_ISL_1055741" "EPI_ISL_961659" "EPI_ISL_1055730" "EPI_ISL_961643" ...
 $ Host                      : chr [1:428] "Human" "Human" "Human" "Human" ...
 $ Originating Lab           : chr [1:428] "QEII Health Sciences Centre" "Hôpital Georges L. Dumont" "QEII Health Sciences Centre" "Hôpital Georges L. Dumont" ...
 $ PANGO lineage             : chr [1:428] "A.2.5" "A.23.1" "A.23.1" "A.23.1" ...
 $ Submission Date           : chr [1:428] "Older" "Older" "Older" "Older" ...
 $ Region                    : chr [1:428] "North America" "North America" "North America" "North America" ...
 $ Region of exposure        : chr [1:428] "North America" "North America" "North America" "North America" ...
 $ Sex                       : chr [1:428] "Male" "Female" "Female" "Male" ...
 $ strain                    : chr [1:428] "Canada/NS-NML-16223/2021" "Canada/NB-NML-3294/2021" "Canada/NS-NML-16211/2021" "Canada/NB-NML-5584/2021" ...
 $ Emerging Nextstrain clade : chr [1:428] "19B" "19B" "19B" "19B" ...
 $ Submitting Lab            : chr [1:428] "National Microbiology Laboratory (NML)" "National Microbiology Laboratory (NML)" "National Microbiology Laboratory (NML)" "National Microbiology Laboratory (NML)" ...
 $ url                       : logi [1:428] NA NA NA NA NA NA ...
 $ Collection Data           : Date[1:428], format: "2021-01-16" "2021-01-02" ...
 $ Author                    : chr [1:428] "Anna Majer et al" "Anna Majer et al" "Anna Majer et al" "Anna Majer et al" ...
 $ Country of exposure       : chr [1:428] NA NA NA NA ...
 $ Location                  : chr [1:428] NA NA NA NA ...

1.6.0 Update our pruned tree and add some extra geom_* layers¶

The ggtree package brings a number of ggplot2-compatible geoms to our finger tips. We'll spruce up our tree with some tip labels to begin with. We'll accomplish this with the geom_tiplab() layer.

Component type Layer type Command Description
Tree Geom geom_tree tree structure layer, with multiple layout supported
Tree Geom geom_treescale tree branch scale legend
Node Geom geom_nodepoint annotate internal nodes with symbolic points
Node Annotation geom_range bar layer to present uncertainty of evolutionary inference
Node Annotation geom_rootpoint annotate root node with symbolic point
Branch Annotation geom_label2 modified version of geom_label, with subsetting supported for labelling branches
Branch Annotation geom_segment2 modified version of geom_segment, with subsetting supported
Taxa Geom geom_point2 modified version of geom_point, with subsetting supported
Taxa Geom geom_tippoint annotate external nodes with symbolic points
Taxa Annotation geom_taxalink associate two related taxa by linking them with a curve
Taxa Annotation geom_text2 modified version of geom_text, with subsetting supported
Taxa Annotation geom_tiplab layer of tip labels
Taxa Annotation geom_tiplab2 layer of tip labels for circular layout
Clade Annotation geom_balance highlights the two direct descendant clades of an internal node
Clade Annotation geom_cladelabel annotate a clade with bar and text label
Clade Annotation geom_hilight highlight a clade with rectangle
Clade Annotation geom_strip annotate associated taxa with bar and (optional) text label
In [12]:
# Show our pruned tree

SC2_variants_time.tree %>%
# drop tips based on our list
drop.tip(., to_drop) %>% 

# 1. Data
ggtree(tr = .) + # Need the most recent sampling date
    # 2. Aesthetics
    aes(color = emerging_nextstrain) + 
    
    # Themes
    theme_tree2() + 
    theme(text = element_text(size = 30),
          legend.position = "bottom") + # Move our legend to the bottom

    # Provide some labels
    labs(x="Time",
         title = "Subset of Canadian sequenced strains") +
    
    # Make our guide lines a little thicker 
    guides(colour = guide_legend(title = "Nextstrain\nclade",  
                                          override.aes = list(size=2))) +    
    
    # 4. Geoms
    ### 1.6.0 Add tips labels to our tree tips
    geom_tiplab()

1.6.1 Replace labels with points using different geoms¶

In a tree this packed, you can't really read the tip labels. The data that's most helpful, however, is the colouring. You can replace your labels with points in 3 ways:

  • geom_point()
  • geom_nodepoint()
  • geom_tippoint()

If we use geom_point() we can colour/annotate all of the nodes and tips but we already know that there isn't any phylogenetic information associated with our internal nodes. Likewise, geom_nodepoint() will allow us to colour the nodes but there isn't any grouping information associated with that.

Therefore we'll use geom_tippoint() to colour our tips based on their Nextstrain clade information and we can alter the tip shape to match the province where the strain was sequenced.

In [13]:
# Show entire tree

SC2_variants_time.tree %>%
drop.tip(., to_drop) %>% 

# 1. Data
ggtree(tr = .) + # Need the most recent sampling date
    # 2. Aesthetics
    aes(color = emerging_nextstrain) + 
    
    # Themes
    theme_tree2() + 
    theme(text = element_text(size = 30),
          legend.position = "bottom", # Move our legend to the bottom
          legend.box = "vertical") +  # Stack legends vertically

    # Provide some labels
    labs(x="Time",
         title = "Subset of Canadian sequenced strains") +
    
    # Make our guide lines a little thicker 
    guides(colour = guide_legend(title = "Nextstrain\nclade",  
                                          override.aes = list(size=2)),
           shape = guide_legend(title = "Strain\nlocation")
          ) + 

    # 3. Scaling
    scale_shape_manual(values = c(15:20, 11)) +
    
    # 4. Geoms
    # Add tips labels to our tree tips
    ### 1.6.1 Change our tip shape by strain_division
    geom_tippoint(aes(shape = strain_division), size = 3)  # Add tips only

1.6.2 Fix your x-axis with the mrsd parameter¶

As you can see we are working with a time-based axis but it all appears to be on a relative scale and what we'd really like to see is real-world time. In order to do that, we can assign the mrsd parameter to the most recent sampling date in our dataset. We'll also set our parameter as.Date in order to see the date in a calendar format rather than a decimal-date format.

In [15]:
# Show entire tree

SC2_variants_time.tree %>%
drop.tip(., to_drop) %>% 

# 1. Data
ggtree(tr = ., 
       mrsd = "2021-02-17",    ### 1.6.2 Set the most recent sampling date
       as.Date = TRUE) +       ### 1.6.2 Make sure it's set as a data 
    # 2. Aesthetics
    aes(color = emerging_nextstrain) + 
    
    # Themes
    theme_tree2() + 
    theme(text = element_text(size = 30),
          legend.position = "bottom", # Move our legend to the bottom
          legend.box = "vertical") +  # Stack legends vertically

    # Provide some labels
    labs(x="Time",
         title = "Subset of Canadian sequenced strains") +
    
    # Make our guide lines a little thicker 
    guides(colour = guide_legend(title = "Nextstrain\nclade",  
                                          override.aes = list(size=2)),
           shape = guide_legend(title = "Strain\nlocation")
          ) + 

    # 3. Scaling
    scale_shape_manual(values = c(15:20, 11)) +

    # 4. Geoms
    # Add tips labels to our tree tips
    # Change our tip shape by strain_division
    geom_tippoint(aes(shape = strain_division), size = 3)  # Add tips only

1.7.0 Thin out the tree some more with viewClade() and other information functions¶

Looking at the tree there are still quite a few nodes but we can zoom in on a specific clade using the viewClade() function. To accomplish this we need to access the MRCA of a group. At the same time, we should talk about ways to traverse or navigate this tree. Here's where tidytree offers up a few additional functions for searching your tbl_tree. These take on the form of function(tbl_tree, node_number) or function(tbl_tree, node_label)

function Description
child() Find the children of an internal node.
parent() Find the parent node of a tip or other node.
offspring() Find all of the offspring of a node.
ancestor() Find all of the ancestors of a tip or node
sibling() Find the nearest sibling of a tip (or node)
MRCA() Find the most recent common ancestor between two or more nodes

Let's filter our taxa list a little more searching for variants of the clade 501Y.V1 and 501Y.V2. Then we'll find the MRCA between those tips.

In [16]:
# You can make a list of tips to drop when you're making your ggtree

# Generate a table of tree information to drop
to_view <-

SC2_metadata.df %>% 

    # Filter our tip information
    filter(`Emerging Nextstrain clade` %in% c("20I/501Y.V1", "20H/501Y.V2") & 
           Country == "Canada" &
           `Collection Data` >= as.Date("2021-02-16")
          ) %>% 

    # Just keep the strain names
    pull(Strain)

# What's the list look like?
to_view
  1. 'Canada/NL-NML-17170/2021'
  2. 'Canada/NL-NML-17182/2021'
  3. 'Canada/NL-NML-17173/2021'
  4. 'Canada/NL-NML-17193/2021'
  5. 'Canada/NL-NML-17157/2021'
  6. 'Canada/NL-NML-17185/2021'
  7. 'Canada/NL-NML-17189/2021'
  8. 'Canada/NL-NML-17196/2021'
  9. 'Canada/NL-NML-17184/2021'
  10. 'Canada/NL-NML-17166/2021'
  11. 'Canada/NL-NML-17191/2021'
  12. 'Canada/NL-NML-17172/2021'
  13. 'Canada/NL-NML-17187/2021'
  14. 'Canada/NL-NML-17168/2021'
  15. 'Canada/NL-NML-17194/2021'
  16. 'Canada/NL-NML-17158/2021'
  17. 'Canada/NL-NML-17183/2021'
  18. 'Canada/NL-NML-17165/2021'
In [17]:
# Calculate the MRCA from a subset of our tips

to_view_MRCA <-
# Pass along our tree information
SC2_variants_time.tree %>%

# Get rid of the tips we would before (to match the tree we'll plot)
drop.tip(., to_drop) %>%

# Determine the MRCA between the first 4 tips in the list
MRCA(., to_view[1:4])

# What kind of value does MRCA() return?
to_view_MRCA
727
In [18]:
# How many offspring does it have?

# Pass along our tree information
SC2_variants_time.tree %>%
    
    # Drop our original set of tips
    drop.tip(., to_drop) %>% 
    # make a tibble so we can see all the relevant information
    as_tibble() %>% 
    # Find the offspring of a specific node
    offspring(., to_view_MRCA) %>% 
    # Convert to a tree
    as.treedata() %>% 
    str()
Formal class 'treedata' [package "tidytree"] with 11 slots
  ..@ file       : chr(0) 
  ..@ treetext   : chr(0) 
  ..@ phylo      :List of 5
  .. ..$ edge       : int [1:71, 1:2] 728 730 731 732 732 734 735 735 736 737 ...
  .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. ..$ : NULL
  .. .. .. ..$ : chr [1:2] "parent" "node"
  .. ..$ edge.length: num [1:71] 0.1136 0.067 0.0521 0.0306 0.0333 ...
  .. ..$ tip.label  : chr [1:39] "Canada/NL-NML-16819/2021" "Canada/NL-NML-16824/2021" "Canada/NL-NML-17178/2021" "Canada/NL-NML-16829/2021" ...
  .. ..$ node.label : chr [1:32] "NODE_0006538" "NODE_0006539" "NODE_0006540" "NODE_0006541" ...
  .. ..$ Nnode      : int 32
  .. ..- attr(*, "class")= chr "phylo"
  ..@ data       : tibble [71 x 8] (S3: tbl_df/tbl/data.frame)
  .. ..$ node               : int [1:71] 323 324 325 326 327 328 329 330 331 332 ...
  .. ..$ GISAID             : chr [1:71] "GRY" "GRY" "GRY" "GRY" ...
  .. ..$ emerging_nextstrain: chr [1:71] "20I/501Y.V1" "20I/501Y.V1" "20I/501Y.V1" "20I/501Y.V1" ...
  .. ..$ PANGO              : chr [1:71] "B.1.1.7" "B.1.1.7" "B.1.1.7" "B.1.1.7" ...
  .. ..$ strain_country     : chr [1:71] "Canada" "Canada" "Canada" "Canada" ...
  .. ..$ strain_division    : chr [1:71] "Newfoundland and Labrador" "Newfoundland and Labrador" "Newfoundland and Labrador" "Newfoundland and Labrador" ...
  .. ..$ strain_date        : Date[1:71], format: "2021-02-10" "2021-02-12" ...
  .. ..$ Age                : num [1:71] 47 16 26 62 60 45 19 45 35 31 ...
  ..@ extraInfo  : tibble [0 x 0] (S3: tbl_df/tbl/data.frame)
 Named list()
  ..@ tip_seq    : NULL
  ..@ anc_seq    : NULL
  ..@ seq_type   : chr(0) 
  ..@ tipseq_file: chr(0) 
  ..@ ancseq_file: chr(0) 
  ..@ info       : list()

1.7.1 Pass a ggplot2 object to viewClade()¶

We see that our chosen MRCA node (727) has 39 children based on the number of tip labels in our treedata object. We can also now use this MRCA to determine the specific clade we can see on the tree but this will still be quite large. To view this particular clade, we build our plot as before, and then zoom into it afterwards with the viewClade() function.

Unfortunately, we'll have to ditch our date scale and revert to a decimal-date. If you're not using a calendar-based date, it's not a problem at all. If we look carefully at the data, we'll see that the internal nodes have an NA-value. If you were industrious enough, you could use the branch lengths to traverse the tree and calculate dates for all the internal nodes as well. This would likely solve the issue.

Let's drop the tip names back in there too.

In [19]:
# Show a small clade of the tree

tree.plot <-

SC2_variants_time.tree %>%
# From our tree, drop the Non-Canadian tips
drop.tip(., to_drop) %>% 

# 1. Data
ggtree(tr = ., mrsd = "2021-02-16") + # Need the most recent sampling date
    # 2. Aesthetics
    aes(color = emerging_nextstrain) + 
    
    # Themes
    theme_tree2() + 
    theme(text = element_text(size = 30),
          legend.position = "bottom", # Move our legend to the bottom
          legend.box = "vertical") +  # Stack legends vertically

    # Provide some labels
    labs(x="Time",
         title = "Subset of Canadian sequenced strains") +
    
    # Make our guide lines a little thicker 
    guides(colour = guide_legend(title = "Nextstrain\nclade",  
                                          override.aes = list(size=2)),
           shape = guide_legend(title = "Strain\nlocation")
          ) + 

    # 3. Scaling
    scale_shape_manual(values = c(15:20, 11)) +

    # 4. Geoms
    geom_tiplab(size = 7, align=TRUE) + # Add labels in and right-align them 

    # Add tips labels to our tree tips
    # Change our tip shape by strain_division
    geom_tippoint(aes(shape = strain_division), size = 3)  # Add tips only   

### 1.7.1 View the clade we made using the same MRCA call we already used.
viewClade(tree.plot, MRCA(tree.plot, to_view[1:5]))

1.8.0 Make a new subplot with information from across all clades¶

So we've looked at a few ways to subset and plot our phylogenetic tree. We'll do a quick aside to create a more curated list of tips with representation from across multiple clades. We'll use this as the base tree for our next few graphs so that we can really see the power of using additional information from our treedata object.

In [20]:
# Here's our representative list of tips from multiple clades

curated_tips <-
c('Wuhan/WH01/2019',
'Canada/ON_ON-VIDO-01-2/2020',
'Canada/ON_VIDO-01/2020',
'Canada/BC_37_0-2/2020',
'Canada/BC_69243/2020',
'Canada/QC-CHUM-2008003956A/2020',
'Canada/NL-NML-1387/2020',
'Canada/MB-NML-808/2020',
'Canada/NB-NML-16967/2021',
'Canada/BC_6129127/2020',
'Canada/AB-97776/2020',
'Canada/AB-12948/2020',
'Canada/BC-BCCDC-3564/2020',
'Canada/ON-S67/2020',
'Canada/AB-65233/2020',
'Canada/MB-NML-1057/2020',
'Canada/ON-UHTC-0366/2020',
'Canada/NS-NML-16218/2021',
'Canada/NB-NML-16966/2021',
'Canada/NS_13/2020',
'Canada/Qc-CHUM-2019203856A/2020',
'Canada/NB-NML-3272/2020',
'Canada/ON-UHTC-0267/2020',
'Canada/MB-NML-1037/2020',
'Canada/NS-NML-5213/2020',
'Canada/NS-NML-5141/2021',
'Canada/NS-NML-5140/2021',
'Canada/NS-NML-16208/2021',
'Canada/BC-BCCDC-7012/2020',
'Canada/ON-LTRI-1372/2020',
'Canada/BC-BCCDC-9736/2021',
'Canada/ON-S2125/2021',
'Canada/NS-NML-5181/2020',
'Canada/ON-PPS-21012021_0269/2021',
'Canada/QC-L00314539/2020',
'Canada/NS-NML-16238/2021',
'Canada/NL-NML-16822/2021',
'Canada/QC-L00324589001/2021',
'Canada/NL-NML-17194/2021')
In [21]:
# Make a new list of tips to drop
curated_drop_list <-
    SC2_metadata.df %>% 

    # filter the opposite of our tip labels
    filter(!(Strain %in% curated_tips)) %>% 
    # Only keep the strain information for what we want to drop
    pull(Strain)
In [22]:
# Drop tips from our tree and save it as curated_tree_info.df
curated_tree_info.df <-

    # Pass along our original tree
    SC2_variants_time.tree %>%

    # Drop the tips we created
    drop.tip(., curated_drop_list) %>% 

    # Convert to a tibble
    as_tibble() %>% 

    # Rename our column information
    rename(Clade = emerging_nextstrain, Province = strain_division) %>% 

    mutate(Age = replace_na(Age, replace = 0)) %>% 

    # Convert this to a data frame
    as.data.frame()

# Set the rownames from the label information
rownames(curated_tree_info.df) <- curated_tree_info.df$label

# Let's view the tree
str(curated_tree_info.df)
'data.frame':	72 obs. of  11 variables:
 $ parent        : int  40 42 42 44 44 45 45 46 46 48 ...
 $ node          : int  1 2 3 4 5 6 7 8 9 10 ...
 $ branch.length : num  0.0757 0.0077 0.0077 0.1439 0.0729 ...
 $ label         : chr  "Wuhan/WH01/2019" "Canada/ON_ON-VIDO-01-2/2020" "Canada/ON_VIDO-01/2020" "Canada/MB-NML-808/2020" ...
 $ GISAID        : chr  "L" "L" "L" "S" ...
 $ Clade         : chr  "19A" "19A" "19A" "19B" ...
 $ PANGO         : chr  "B" "B" "B" "A.1" ...
 $ strain_country: chr  "Asia" "Canada" "Canada" "Canada" ...
 $ Province      : chr  "Asia" "Ontario" "Ontario" "Manitoba" ...
 $ strain_date   : Date, format: "2019-12-26" "2020-01-23" ...
 $ Age           : num  44 0 0 0 27 59 0 43 35 53 ...

1.9.0 Convert your tree style with the layout parameter¶

So we've only been working with a rectangular tree (horizontal layout) thus far but as we mentioned at the beginning there are actually a number of different layouts we can use. Our goal now is to make a sort of sunburst plot now using a circular layout that we'll add visual categorical information to in a ring pattern.

First, we'll start with the circular version. We'll colour our tips by the province where the case or data was generated and label them by the Nextstrain clade information.

In [24]:
# Show our curated tree in a circular format

tree.plot <-

SC2_variants_time.tree %>%
drop.tip(., curated_drop_list)  %>% 

# 1. Data
ggtree(., layout = "circular", ### 1.9.0 Set the layout to circular
    mrsd = "2021-02-17") + # Need the most recent sampling date
    # 2. Aesthetics

    # Themes
    theme(text = element_text(size = 30),
          legend.position = "bottom") +
    labs(title = "Canada and US sequenced strain phylogeny",
         colour = "Province") +

    scale_colour_viridis_d(option = "plasma") + # This seems to literally kill the kernel!

    # 4. Geoms
    geom_tippoint(aes(colour = strain_division), size = 5) + 

    ### 1.9.0 set the tip labels
    geom_tiplab(aes(label = emerging_nextstrain), size = 8)

# View our tree
tree.plot

1.10.0 Use gheatmap() layer to add information to your plot¶

Now that we have our basic circular tree, we can use the information from curated_tree_info.df to visualize additional data on our tree. In order to use the gheatmap layer properly, we must pass it our original tree plot, along with a new dataframe (or vector) that holds the information we want to add. It should use the rownames from the data frame to help match up to the tips in our tree.

We'll drop the tip labels from our tree in favour of the heatmap we'll add. By adding layers of hierarchical data, this creates what is known as a Sunburst plot.

The gheatmap() layer takes some of the following parameters:

  • p: the tree plot we want to modify.
  • data: the data used to modify the tree plot p.
  • offset: determines the offset of the heatmap relative to the tree.
  • width: determines the total width of the heatmaps compared to the tree
  • colnames_angle: determines the angle of the text for the heatmap.
  • font.size: sets the font size for the heatmap portion of the tree.
In [26]:
# Show our curated tree in a circular format

tree.plot <-

SC2_variants_time.tree %>%
drop.tip(., curated_drop_list)  %>% 

# 1. Data
ggtree(., layout = "circular", mrsd = "2021-02-17") + # Need the most recent sampling date
    # 2. Aesthetics
    
    # Themes
    theme(text = element_text(size = 30),
          legend.position = "bottom") +
    labs(title = "Canada and US sequenced strain phylogeny",
         colour = "Province") +

    scale_colour_viridis_d(option = "plasma") +

    # 4. Geoms
    geom_tippoint(aes(colour = strain_division), size = 5) # Add tips only 

### 1.10.0 Add a circular heatmap with gheatmap()
tree.plot <- gheatmap(tree.plot, curated_tree_info.df %>% select(parent, node, branch.length), 
                      offset = 0.1, width = 0.1,
                      colnames_angle = 90, font.size = 6, hjust = 1) + 
    scale_fill_continuous(name = "Tree coding") 


tree.plot
Scale for fill is already present.
Adding another scale for fill, which will replace the existing scale.

1.10.1 ggnewscale package allows multiple color/fill scales on your Sunburst plot¶

From our sunburst plot above we used multiple columns from curated_tree_info.df but the caveat is that the legend for each resulting column combined all of the data into a single new legend. That's helpful if the type of data could be continuous values like a [0,1] scale heatmap colour across different features. When dealing with multiple categories where the scale of the values varies, however, that doesn't work well for us. This also doesn't work well for categorical data.

Instead, we'd like to separate the colour/fill scales by repeatedly adding to our plot, layer by layer. As you might recall, however, we also run into the problem of scale_colour_* and scale_fill_* layers overwriting any previous layers.

To circumvent this reality, we'll use the ggnewscale package to generate new colour and fill scales with new_scale_fill() and new_scale_colour(). It can make the process slightly more encumbering but it will work out.

We'll add back in our original tip labels as well for this graph.

In [27]:
# Adjust our plot window size according to the expected output
options(repr.plot.width=45, repr.plot.height=45)

# Show our curated tree in a circular format

# We'll need some extra colour sets to accomplish this plot
combo.colours = c(brewer.pal(12, "Paired"), brewer.pal(8, "Set1"), brewer.pal(12, "Set3"))


# Save our first plot with just the tree
tree.plot <-

SC2_variants_time.tree %>%
drop.tip(., curated_drop_list)  %>% 

# 1. Data
ggtree(., layout = "circular", mrsd = "2021-02-17") + # Need the most recent sampling date
    # 2. Aesthetics
    
    # Themes
    theme(text = element_text(size = 30),
          legend.position = "bottom") +
    labs(title = "Canada and US sequenced strain phylogeny") +

    # We'll want to ensure our colour guide ends up in the correct order
    scale_colour_discrete(guide = guide_legend(title="Province", order=1)) +

    # 4. Geoms
    geom_tippoint(aes(colour = strain_division), size = 5) + # Add tips only 
    geom_tiplab(size = 10, align=TRUE, offset = 0.6) # Add in our labels


# Generate our first heatmap layer
tree.plot <- gheatmap(tree.plot, curated_tree_info.df %>% select(parent, node, branch.length), 
                      offset = 0.1, width = 0.1,
                      colnames_angle = 90, font.size = 6) + 
    # Set the name of our legend
    scale_fill_continuous(name = "Tree coding",
                          guide = guide_legend(order = 2)) 

### 1.10.1 use this code to create a new colour scale
tree.plot <- tree.plot + new_scale_fill()

# Generate a categorical heatmap layer for the Clade variable
tree.plot <- gheatmap(tree.plot, curated_tree_info.df %>% select(Clade), 
                      offset = 0.2, width = 0.1,
                      colnames_angle = 90, font.size = 12) + 
    
    # Set the name and order of our Clade legend
    scale_fill_discrete(name = "Nextstrain\nClade", 
                        guide = guide_legend(order=3))

### 1.10.1 use this code to create a new colour scale
tree.plot <- tree.plot + new_scale_fill()

# Generate a categorical heatmap layer for the PANGO variable
tree.plot <- gheatmap(tree.plot, curated_tree_info.df %>% select(PANGO), 
                      offset = 0.35, width = 0.1,
                      colnames_angle = 90, font.size = 12) +     

    # Set the name and order of our PANGO legend
    scale_fill_manual(values = combo.colours, name = "PANGO\nlineage", 
                      guide=guide_legend(order=4))

# It's going to look stretched and weird in google colab
tree.plot
Scale for fill is already present.
Adding another scale for fill, which will replace the existing scale.
Scale for fill is already present.
Adding another scale for fill, which will replace the existing scale.
Scale for fill is already present.
Adding another scale for fill, which will replace the existing scale.

1.10.2 Create a vertical version of our plot using a rectangular layout¶

Depending on the level of data depth, you may also choose to layer yoru scales along the right-hand side of your tree. To accomplish this we can simply set our layout paramter to "rectangular". At the same time, we'll run into a couple of problems with how the ggtree is made.

One issue with ggplot is that as we add the scales, these are very much like annotations on our ggplot. The vertical (and horizontal) diplay areas of our plot are not updated as we add these annotations. When we add new colour scale layers to the plot, the labels will, in many cases, bleed outside of the designated plot area. To combat this, we can add a layer called vexpand() or hexpand() for horizontal expansion. Both of these take 2 parameters:

  • ratio: the ratio of expansion for your plot axis
  • direction: the direction you want the expansion in a range from 1 (left/top) to -1 (right/bottom).
In [29]:
# Adjust our plot window size according to the expected output
options(repr.plot.width=30, repr.plot.height=35)

# Show our curated tree in a rectangular format

# We'll need some extra colour sets to accomplish this plot
combo.colours = c(brewer.pal(12, "Paired"), brewer.pal(8, "Set1"), brewer.pal(12, "Set3"))


# Save our first plot with just the tree
tree.plot <-

SC2_variants_time.tree %>%
drop.tip(., curated_drop_list)  %>% 

# 1. Data
ggtree(., layout = "rectangular", ### 1.10.2 convert your plot to a rectangular format 
       mrsd = "2021-02-17") + # Need the most recent sampling date
    # 2. Aesthetics
    
    # Themes
    theme(text = element_text(size = 30),
          legend.position = "bottom",
         ) +
    labs(title = "Canada and US sequenced strain phylogeny",
         colour = "Province") +

    ### 1.10.2 Expand the vertical plot size
    vexpand(ratio = 0.1, direction = -1) + 

    # We'll want to ensure our colour guide ends up in the correct order
    scale_colour_discrete(guide = guide_legend(title="Province", order=1)) +

    # 4. Geoms
    geom_tippoint(aes(colour = strain_division), size = 5) + # Add tips only 
    geom_tiplab(size = 10, align=TRUE) # Add in our labels 


# Generate our first heatmap layer
tree.plot <- gheatmap(tree.plot, curated_tree_info.df %>% select(parent, node, branch.length), 
                      offset = 0.75, width = 0.1,
                      colnames_angle = 90, font.size = 8, hjust = 1) + 
    # Set the name of our legend
    scale_fill_continuous(name = "Tree coding",
                          guide = guide_legend(order = 2)) 

# use this code to create a new colour scale
tree.plot <- tree.plot + new_scale_fill() 

# Generate a categorical heatmap layer for the Clade variable
tree.plot <- gheatmap(tree.plot, curated_tree_info.df %>% select(Clade), 
                      offset = 0.85, width = 0.1,
                      colnames_angle = 90, font.size = 12, hjust = 1) + 
    
    # Set the name and order of our Clade legend
    scale_fill_discrete(name = "Nextstrain\nClade", 
                        guide = guide_legend(order=3))

# use this code to create a new colour scale
tree.plot <- tree.plot + new_scale_fill() 

# Generate a categorical heatmap layer for the PANGO variable
tree.plot <- gheatmap(tree.plot, curated_tree_info.df %>% select(PANGO), 
                      offset = 1.0, width = 0.1,
                      colnames_angle = 90, font.size = 12, hjust = 1) +     

    # Set the name and order of our PANGO legend
    scale_fill_manual(values = combo.colours, name = "PANGO\nlineage", 
                      guide=guide_legend(order=4))

# View the tree
tree.plot
Scale for fill is already present.
Adding another scale for fill, which will replace the existing scale.
Scale for fill is already present.
Adding another scale for fill, which will replace the existing scale.
Scale for fill is already present.
Adding another scale for fill, which will replace the existing scale.

1.11.0 Annotate connections between tips with geom_taxalink()¶

Suppose we want to join to tips/taxa together to suggest that there is a relationship between them or to emphasize some association between nodes. To connect nodes or tips of the tree, we can use the geom_taxalink() layer.

Let's join the nodes Canada/NL-NML-17194/2021 and Canada/ON-UHTC-0366/2020 together.

In [30]:
# Adjust our plot window size according to the expected output
options(repr.plot.width=30, repr.plot.height=30)

# Show our curated tree in a rectangular format

# We'll need some extra colour sets to accomplish this plot
combo.colours = c(brewer.pal(12, "Paired"), brewer.pal(8, "Set1"), brewer.pal(12, "Set3"))


# Save our first plot with just the tree
tree.plot <-

SC2_variants_time.tree %>%
drop.tip(., curated_drop_list)  %>% 

# 1. Data
ggtree(., layout = "rectangular", mrsd = "2021-02-17") + # Need the most recent sampling date
    # 2. Aesthetics
    
    # Themes
    theme(text = element_text(size = 30),
          legend.position = "bottom",
          ) +
    labs(title = "Canada and US sequenced strain phylogeny",
         colour = "Province") +
    vexpand(ratio = 0.1, direction = -1) + # Expand the vertical plot size

    # We'll want to ensure our colour guide ends up in the correct order
    scale_colour_discrete(guide = guide_legend(title="Province", order=1)) +

    # 4. Geoms
    geom_tippoint(aes(colour = strain_division), size = 5) + # Add tips only 
    geom_tiplab(size = 10, align=TRUE) + # Add in our labels

    ### 1.11.0 Add in some annotation between points
    geom_taxalink(taxa1='Canada/NL-NML-17194/2021', taxa2='Canada/ON-UHTC-0366/2020', 
                  color="blue", alpha=0.8, offset=0, 
                  outward=FALSE,
                  arrow=arrow(length=unit(0.01, "npc")))


# Generate our first heatmap layer
tree.plot <- gheatmap(tree.plot, curated_tree_info.df %>% select(parent, node, branch.length), 
                      offset = 0.75, width = 0.1,
                      colnames_angle = 90, font.size = 8, hjust = 1) + 
    # Set the name of our legend
    scale_fill_continuous(name = "Tree coding",
                          guide = guide_legend(order = 2)) 

# use this code to create a new colour scale
tree.plot <- tree.plot + new_scale_fill() 

# Generate a categorical heatmap layer for the Clade variable
tree.plot <- gheatmap(tree.plot, curated_tree_info.df %>% select(Clade), 
                      offset = 0.85, width = 0.1,
                      colnames_angle = 0, font.size = 10) + 
    
    # Set the name and order of our Clade legend
    scale_fill_discrete(name = "Nextstrain\nClade", 
                        guide = guide_legend(order=3))

# use this code to create a new colour scale
tree.plot <- tree.plot + new_scale_fill() 

# Generate a categorical heatmap layer for the PANGO variable
tree.plot <- gheatmap(tree.plot, curated_tree_info.df %>% select(PANGO), 
                      offset = 1.0, width = 0.1,
                      colnames_angle = 0, font.size = 10) +     

    # Set the name and order of our PANGO legend
    scale_fill_manual(values = combo.colours, name = "PANGO\nlineage", 
                      guide=guide_legend(order=4))

# View the tree
tree.plot
Scale for fill is already present.
Adding another scale for fill, which will replace the existing scale.
Scale for fill is already present.
Adding another scale for fill, which will replace the existing scale.
Scale for fill is already present.
Adding another scale for fill, which will replace the existing scale.

1.12.0 There is definitely more to explore from the ggtree package¶

So we've spent quite a bit of time looking at phylogenetic trees and how to add extrenal data to tips and nodes. We've barely scratched the surface and there are a lot of additional functions within the ggtree package that you can work with to build amazing plots including adding fasta sequence data through msaplot(). You can experiment with labeling internal nodes, and specific clades as well. You can also facet_plot() rectangular trees with other kinds of plots!

We didn't even have time to combine all sorts of plot data with our circular trees using the ggtreeExtra package. Definitely check out Chapter 10 of the tree book to be inspired by the potential things you could do with more complex datasets like this:

The geom_fruit() layer can be used to add extra visualization to your sunburst plot through the ggtreeExtra package

2.0.0 What are network diagrams and how can we use them?¶

We've seen a lot of trees today but a closely related structure is the network diagram. They share similar concepts in that both have nodes and branches. However, nodes are called vertices and can represent almost anything, branches between nodes are edges and can span across multiple nodes, instead of in a bifurcating tree relationship. Relationships between vertices can be bidirectional, and depending on what you'd like to present, multiple parallel edges may exist. We saw a specific version of network diagrams in Lecture 04 with our Sankey plots!

Network graphs are great for showing the interconnections between entities within your data. This kind of visualization can also help us realize where subtle connections occur (or don't occur!). Depending on how you've weighted or chosen edges, you can convey how strong relationships between entities are as well.

To work with graph data and plot it, we'll be using a couple of companion packages: tidygraph and ggraph. More about that later!

Since we already have some metadata about COVID genomes, let's see if we can't convert some of it to a network diagram to better understand strain information? We'll begin by selecting some information from our Nextstrain metadata set.

In [31]:
str(SC2_metadata.df, give.attr = FALSE)
spc_tbl_ [6,627 x 24] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Strain                    : chr [1:6627] "Wuhan/WH01/2019" "USA/MA1/2020" "Canada/ON_ON-VIDO-01-2/2020" "Canada/ON_VIDO-01/2020" ...
 $ GISAID clade              : chr [1:6627] "L" "L" "L" "L" ...
 $ Nextstrain clade          : chr [1:6627] "19A" "19A" "19A" "19A" ...
 $ Age                       : num [1:6627] 44 21 NA NA 54 NA 47 47 23 55 ...
 $ Clade                     : chr [1:6627] "19A" "19A" "19A" "19A" ...
 $ Country                   : chr [1:6627] "Asia" "USA" "Canada" "Canada" ...
 $ Admin Division            : chr [1:6627] "Asia" "Massachusetts" "Ontario" "Ontario" ...
 $ Admin Division of exposure: chr [1:6627] "Asia" "Asia" "Ontario" "Asia" ...
 $ gisaid_epi_isl            : chr [1:6627] "EPI_ISL_406798" "EPI_ISL_409067" "EPI_ISL_425177" "EPI_ISL_413015" ...
 $ Host                      : chr [1:6627] "Human" "Human" "Human" "Human" ...
 $ Originating Lab           : chr [1:6627] "General Hospital of Central Theater Command of People's Liberation Army of China" "Massachusetts Department of Public Health" "Public Health Ontario" "Public Health Ontario Laboratory" ...
 $ PANGO lineage             : chr [1:6627] "B" "B" "B" "B" ...
 $ Submission Date           : chr [1:6627] "Older" "Older" "Older" "Older" ...
 $ Region                    : chr [1:6627] "Asia" "North America" "North America" "North America" ...
 $ Region of exposure        : chr [1:6627] "Asia" "Asia" "North America" "Asia" ...
 $ Sex                       : chr [1:6627] "Male" "Male" "Male" "Male" ...
 $ strain                    : chr [1:6627] "Wuhan/WH01/2019" "USA/MA1/2020" "Canada/ON_ON-VIDO-01-2/2020" "Canada/ON_VIDO-01/2020" ...
 $ Emerging Nextstrain clade : chr [1:6627] "19A" "19A" "19A" "19A" ...
 $ Submitting Lab            : chr [1:6627] "BGI & Institute of Microbiology, Chinese Academy of Sciences & Shandong First Medical University & Shandong Aca"| __truncated__ "Pathogen Discovery, Respiratory Viruses Branch, Division of Viral Diseases, Centers for Disease Control and Prevention" "Public Health Agency of Canada - National Microbiology Laboratory" "National Microbiology Laboratory" ...
 $ url                       : logi [1:6627] NA NA NA NA NA NA ...
 $ Collection Data           : Date[1:6627], format: "2019-12-26" "2020-01-29" ...
 $ Author                    : chr [1:6627] "Weijun Chen et al (https://dx.doi.org/10.1016/S0140-6736(20)30251-8)" "Clinton R. Paden et al (https://dx.doi.org/10.1101/2020.03.09.20032896)" "Amrit S. Boese et al" "Shari Tyson et al" ...
 $ Country of exposure       : chr [1:6627] NA "Asia" NA "Asia" ...
 $ Location                  : chr [1:6627] NA "Boston" NA NA ...

2.0.1 Perform a bit of data wrangling to select our data¶

Before we jump into making some network diagrams, we'll want to trim down the data from SC2_metadata.df. We don't need all of the data that is currently in this table. Instead we'll trim it down to 10 variables:

  • Strain: the strain name of the genome. This matches the tips of our original trees we made in section 1.0.0.
  • Country: the country where the case was reported.
  • Admin Division: the sub-region where the case was identified
  • Admin Division of exposure: the sub-region believed to be the source of the strain.
  • Region: appears to be the continent where the case was identified.
  • Region of exposure: the continent was the case was contracted.
  • Nextstrain clade: the Nextstrain clade for this strain
  • GISAID clade: the clade classifiction defined by the Global Initiative on Sharing Avian Influenza Data
  • PANGO lineage: the Phylogenetic Assignment of Named Global Outbreak lineage
  • Collection Data: the collection date of the strain (also misspelled)
In [32]:
SC2_graph_info.df <-

# Pass along our metadata
SC2_metadata.df %>% 

# We'll grab information about each case that relates to its geographical region
select(Strain, Country, 'Admin Division', 'Admin Division of exposure', 'Region of exposure', Region,
       'Nextstrain clade', 'GISAID clade', 'PANGO lineage', 'Collection Data') %>% 

# Do a bunch of renaming for our variables
rename(source_location = 'Admin Division of exposure', 
       case_location = 'Admin Division',
       source_region = 'Region of exposure',
       case_region = 'Region',
       collection_date = 'Collection Data',
       Nextstrain = 'Nextstrain clade',
       GISAID = 'GISAID clade',
       PANGO = 'PANGO lineage'
      )

head(SC2_graph_info.df)
A tibble: 6 × 10
StrainCountrycase_locationsource_locationsource_regioncase_regionNextstrainGISAIDPANGOcollection_date
<chr><chr><chr><chr><chr><chr><chr><chr><chr><date>
Wuhan/WH01/2019 Asia Asia Asia Asia Asia 19ALB2019-12-26
USA/MA1/2020 USA MassachusettsAsia Asia North America19ALB2020-01-29
Canada/ON_ON-VIDO-01-2/2020CanadaOntario OntarioNorth AmericaNorth America19ALB2020-01-23
Canada/ON_VIDO-01/2020 CanadaOntario Asia Asia North America19ALB2020-01-23
USA/CA-CDC-5/2020 USA California Asia Asia North America19ALB2020-01-29
Taiwan/CGMH-CGU-02/2020 Asia Asia Asia Asia Asia 19ASB2020-02-04

2.1.0 Use the tidygraph package to help prepare data¶

Now that we have some information that we want to work with, we want to convert that kind of data to something that can be interpreted into a graph. The tidygraph package provides a way to hook graph data into the tidyverse so that we can use common verbs and ideas to filter and work with it. Using this package we can convert our dataframe information into a tbl_graph object which is actually an igraph object.

The function we'll use to convert our data is as_tbl_graph() but it has some expectations about the data. The parameters of this function are:

  • x: the data frame we'd like to convert to an igraph.
  • nodes: a data frame with our node information.
  • edges: a data frame of two columns containing integers matching node information to describe the relationship between nodes.

If you have both a nodes and edges data frame you can certainly generate a table this way. However, we have a complex dataframe and we want to track all of the information in it. To that end we will add columns from and to based on variables that already exist and the graph nodes will be generated from this information. When we provide the data frame, the function will recognize the columns present and produce node-based data and generate edge characteristics from our other columns.

In [33]:
# Now we'll add some specific variables used to generate our graph table

SC2_graph_info.df %>% 

    # Create our "from" and "to" columns in our data frame
    mutate(from = source_location,
           to = Country
          ) %>%  

    # Filter for just strain data from Canada and Asia
    filter(Country %in% c("Canada", "Asia")) %>% 

    # Convert to an igraph
    as_tbl_graph() %>% 

    # Look at the structure of the igraph
    str()
List of 14
 $ :List of 1
  ..$ Asia: 'igraph.vs' Named int [1:1219] 1 1 1 1 1 1 1 1 1 1 ...
  .. ..- attr(*, "names")= chr [1:1219] "Asia" "Asia" "Asia" "Asia" ...
  .. ..- attr(*, "env")=<weakref> 
  .. ..- attr(*, "graph")= chr "0cd1671a-de00-11ed-a4f2-f14218d91ea9"
 $ :List of 1
  ..$ Ontario: 'igraph.vs' Named int [1:189] 10 10 10 10 10 10 10 10 10 10 ...
  .. ..- attr(*, "names")= chr [1:189] "Canada" "Canada" "Canada" "Canada" ...
  .. ..- attr(*, "env")=<weakref> 
  .. ..- attr(*, "graph")= chr "0cd1671a-de00-11ed-a4f2-f14218d91ea9"
 $ :List of 1
  ..$ Nova Scotia: 'igraph.vs' Named int [1:150] 10 10 10 10 10 10 10 10 10 10 ...
  .. ..- attr(*, "names")= chr [1:150] "Canada" "Canada" "Canada" "Canada" ...
  .. ..- attr(*, "env")=<weakref> 
  .. ..- attr(*, "graph")= chr "0cd1671a-de00-11ed-a4f2-f14218d91ea9"
 $ :List of 1
  ..$ British Columbia: 'igraph.vs' Named int [1:143] 10 10 10 10 10 10 10 10 10 10 ...
  .. ..- attr(*, "names")= chr [1:143] "Canada" "Canada" "Canada" "Canada" ...
  .. ..- attr(*, "env")=<weakref> 
  .. ..- attr(*, "graph")= chr "0cd1671a-de00-11ed-a4f2-f14218d91ea9"
 $ :List of 1
  ..$ Quebec: 'igraph.vs' Named int [1:170] 10 10 10 10 10 10 10 10 10 10 ...
  .. ..- attr(*, "names")= chr [1:170] "Canada" "Canada" "Canada" "Canada" ...
  .. ..- attr(*, "env")=<weakref> 
  .. ..- attr(*, "graph")= chr "0cd1671a-de00-11ed-a4f2-f14218d91ea9"
 $ :List of 1
  ..$ New Brunswick: 'igraph.vs' Named int [1:79] 10 10 10 10 10 10 10 10 10 10 ...
  .. ..- attr(*, "names")= chr [1:79] "Canada" "Canada" "Canada" "Canada" ...
  .. ..- attr(*, "env")=<weakref> 
  .. ..- attr(*, "graph")= chr "0cd1671a-de00-11ed-a4f2-f14218d91ea9"
 $ :List of 1
  ..$ Manitoba: 'igraph.vs' Named int [1:148] 10 10 10 10 10 10 10 10 10 10 ...
  .. ..- attr(*, "names")= chr [1:148] "Canada" "Canada" "Canada" "Canada" ...
  .. ..- attr(*, "env")=<weakref> 
  .. ..- attr(*, "graph")= chr "0cd1671a-de00-11ed-a4f2-f14218d91ea9"
 $ :List of 1
  ..$ Newfoundland and Labrador: 'igraph.vs' Named int [1:161] 10 10 10 10 10 10 10 10 10 10 ...
  .. ..- attr(*, "names")= chr [1:161] "Canada" "Canada" "Canada" "Canada" ...
  .. ..- attr(*, "env")=<weakref> 
  .. ..- attr(*, "graph")= chr "0cd1671a-de00-11ed-a4f2-f14218d91ea9"
 $ :List of 1
  ..$ Alberta: 'igraph.vs' Named int [1:56] 10 10 10 10 10 10 10 10 10 10 ...
  .. ..- attr(*, "names")= chr [1:56] "Canada" "Canada" "Canada" "Canada" ...
  .. ..- attr(*, "env")=<weakref> 
  .. ..- attr(*, "graph")= chr "0cd1671a-de00-11ed-a4f2-f14218d91ea9"
 $ :List of 1
  ..$ Canada: 'igraph.vs' Named int [1:12] 10 10 10 10 10 10 10 10 10 10 ...
  .. ..- attr(*, "names")= chr [1:12] "Canada" "Canada" "Canada" "Canada" ...
  .. ..- attr(*, "env")=<weakref> 
  .. ..- attr(*, "graph")= chr "0cd1671a-de00-11ed-a4f2-f14218d91ea9"
 $ :List of 1
  ..$ Europe: 'igraph.vs' Named int [1:57] 1 1 1 1 1 1 1 1 1 1 ...
  .. ..- attr(*, "names")= chr [1:57] "Asia" "Asia" "Asia" "Asia" ...
  .. ..- attr(*, "env")=<weakref> 
  .. ..- attr(*, "graph")= chr "0cd1671a-de00-11ed-a4f2-f14218d91ea9"
 $ :List of 1
  ..$ North America: 'igraph.vs' Named int [1:15] 1 1 1 1 1 1 1 1 1 1 ...
  .. ..- attr(*, "names")= chr [1:15] "Asia" "Asia" "Asia" "Asia" ...
  .. ..- attr(*, "env")=<weakref> 
  .. ..- attr(*, "graph")= chr "0cd1671a-de00-11ed-a4f2-f14218d91ea9"
 $ :List of 1
  ..$ Africa: 'igraph.vs' Named int [1:5] 1 1 1 1 1
  .. ..- attr(*, "names")= chr [1:5] "Asia" "Asia" "Asia" "Asia" ...
  .. ..- attr(*, "env")=<weakref> 
  .. ..- attr(*, "graph")= chr "0cd1671a-de00-11ed-a4f2-f14218d91ea9"
 $ :List of 1
  ..$ South America: 'igraph.vs' Named int [1:3] 1 1 1
  .. ..- attr(*, "names")= chr [1:3] "Asia" "Asia" "Asia"
  .. ..- attr(*, "env")=<weakref> 
  .. ..- attr(*, "graph")= chr "0cd1671a-de00-11ed-a4f2-f14218d91ea9"
 - attr(*, "class")= chr [1:2] "tbl_graph" "igraph"
 - attr(*, "active")= chr "nodes"

2.2.0 Plot your graph information with ggraph()¶

The ggraph() function is one of many from a package of the same name. This package adds geoms and architecture that is compatible with ggplot2 (for the most part). Some of the functions we are interested in are:

Component Function Description
Graph ggraph The equivalent of the call to ggplot, it sets up the basic information about the graph plot
Edge geom_edge_link Produces edges between points
Edge geom_edge_fan Produces edges between points but accounts for parallel edges, creating arcs that fan out
Edge geom_edge_parallel Produces multiple edges between points with parallel lines representing parallel edges
Edge geom_edge_loop Used to represent edges that start and end at the same node. Does not account for parallel edges
Node geom_node_point Add basic nodes to your graph
Node geom_node_circle Add nodes to your graph that can be scaled by the coordinate system

The ggraph() function takes in 3 parameters:

  • graph: the igraph object that we'll base the graph on.
  • layout: the type of layout for the graph. There are many options including: auto, dendrogram, linear, matrix, treemap (yep!) circlepack, partition, and hive.
  • circular: a logical stating whether the layout should be transformed into a radial representation. It can't be applied to all layouts (think polar_coord).
In [34]:
# Adjust our plot window size according to the expected output
options(repr.plot.width=12, repr.plot.height=12)

# Now we'll add some specific variables used to generate our graph table

SC2_graph_info.df %>% 

    # Generate our from and to edge information
    mutate(from = source_region,
          to = case_region
          ) %>%  

    # Convert to an igraph and pass it on to be plotted
    as_tbl_graph() %>% 

    # 1. Data
    ggraph(.) + 
        # 2. Aesthetics
        theme(text = element_text(size = 20)) +

        ### 2.2.0 Add our edges
        geom_edge_link(aes(colour = Nextstrain), arrow=arrow(), width = 1) +

        ### 2.2.0 Add our nodes and label them
        geom_node_point(size = 7) + 
        geom_node_text(aes(label = name), size = 7, repel = TRUE)
Using `stress` as default layout

Warning message:
"Using the `size` aesthetic in this geom was deprecated in ggplot2 3.4.0.
i Please use `linewidth` in the `default_aes` field and elsewhere instead."

2.2.1 Use geom_edge_fan() to visualize parallel edges¶

So from our graph we can see some characteristics about how our case data is connected. Although most of the data is centred around genomes sequenced in North America so we can see that "North America" is also a source region.

The way we see the edges, however also has all of them overlaying on top of each other. We know that this isn't going to be a 1:1 relationship and although we have coloured the edges, we only see the topmost edge in a stack of many. For a graph like this, if we want to see all of the edges, we can use the geom_edge_fan() layer which will help us see all of our parallel edges in all of their splendour. Of note, we can also use geom_edge_parallel() as a layer. While this layer may make the number of connections clearer, it can get quite crowded.

Let's try it out.

In [35]:
# Now we'll add some specific variables used to generate our graph table

SC2_graph_info.df %>% 

    # Generate our from and to edge information
    mutate(from = source_region,
          to = case_region
          ) %>%  

    # Convert to an igraph
    as_tbl_graph() %>% 

    # 1. Data

    ggraph(.) + 
        # 2. Aesthetics
        theme(text = element_text(size = 20)) +

        ### 2.2.1 Add our edges
        geom_edge_fan0(aes(colour = Nextstrain), arrow=arrow()) +

        # Add our nodes and label them
        geom_node_point(size = 7) + 
        geom_node_text(aes(label = name), size = 7, repel = TRUE)
Using `stress` as default layout


2.2.2 Filter and choose your data carefully to better visualize it¶

The metadata we've chosen isn't very detailed and that can be the case for many datasets so you'll want to be sure of how you pick your nodes. Right now we are picking our from and to values based on the supposed source and identifying continents of the strain.

While other variables could offer more information, they can vary in consistency from a region (Asia) to a province (Ontario) as seen in variable case_location. Definitely aim to be consistent when you're working with your data otherwise the nodes may have less meaning.

Let's try looking at source_region (Continent) to case_location (Divisions) as it relates to data from Canada and Asia

In [36]:
# Adjust our plot window size according to the expected output
options(repr.plot.width=20, repr.plot.height=12)

# Now we'll add some specific variables used to generate our graph table

SC2_graph_info.df %>% 

    # Generate our from and to edge information
    mutate(from = source_region,
          to = case_location
          ) %>%  

    # Filter for Canada data
    filter(Country %in% c("Canada", "Asia")) %>% 

    # Convert to an igraph
    as_tbl_graph() %>% 

    # 1. Data
    ggraph(.) + 
        # 2. Aesthetics
        theme(text = element_text(size = 20)) +

        # Add our edges
        geom_edge_fan(aes(colour = Nextstrain), arrow=arrow()) +
        ### 2.2.2 If we suspect looping edges, we need to define that layer specifically
        geom_edge_loop() +

        # Add our nodes and label them
        geom_node_point(size = 7) + 
        geom_node_text(aes(label = name), size = 7, repel = TRUE)
Using `stress` as default layout


2.3.0 There are many more visualizations from ggraph¶

Interestingly in the first year of the pandemic, from the sequenced genome data, infections within Canada appear to originate mostly from within North America but there are some infections that originated from Asia, and entered through Ontario and British Columbia - our two main global ports of entry!

We've really only covered linear graph layouts in our data but there are actually many kinds of graphs that this package can produce. The layout parameter is the key to exploring all of the graph types available. Of course your data layout all has to make sense! You'll find great examples from data-imaginist.com like this circlepack graph:

There are allo kinds of network graphs that can use additional layers of metadata to shape and present your data

3.0.0 Sequence motifs and genomic markers¶

When working with data sometimes you may be working with sequence-specific data for analysis of motifs or you may have a large set of data describing genomic markers like single nucleotide variants. Two popular visualizations in this domain are the sequence logo and Manhattan plot.

3.1.0 Generate sequence logos with ggseqlogo¶

This one is short and simple. If you have a series of sequences you'd like to plot based on frequency of each bases you see at each position, the sequence logo is for you. The ggseqlogo package offers a simple ggplot2-compatible set of geoms you can add to your graph in order to represent this data.

Your data can come in two flavours:

  1. A named list where each index position is a vector of sequences, and the names of each index can have meaning (like a transcription factor).
  2. A frequency matrix where columns correspond to positions, and rownames correspond to bases (or amino acids). Each entry is the number of times that a base is seen at a specific position. These matrices could be daisy-chained in a list.

Let's work with the former since it does have some flexibility and looks like something we recognize.

In [37]:
# Build an example sequence set (it's actually from the SARS-CoV-2 spike sequence )
example_seq1 <- c("aacccactaatggtgttggtt","aatccactaatggtgttgctt","aacccaataatagtgttggtt",
                 "aacccactaatggtgttggtt","aacccactaatggtattgatt","accccactaatgttgttggtt",
                 "aacccactaatggtgttggtt","aacccactaatggtattgatt","accccactaatgttgttggtt") %>% 

    # Convert it all to upper case because ggseqlogo appears to hate when we don't
    str_to_upper()

example_seq2 <- c("cacccactaatggtgttggtg","aatccactaatggtgttgctt","aacccaataatagtgttggtt",
                 "cacccactaatggtgttggtg","aacccactaatggtattgatt","accccactaatgttgttggtt",
                 "cacccactaatggtgttggtg","aacccactaatggtattgatt","accccactaatgttgttggtt") %>% 

    # Convert it all to upper case because ggseqlogo appears to hate when we don't
    str_to_upper()

# Build our list
example_seq.list = list(spike_set1 = example_seq1, spike_set2 = example_seq2)

# Take a look at our list
str(example_seq.list)
List of 2
 $ spike_set1: chr [1:9] "AACCCACTAATGGTGTTGGTT" "AATCCACTAATGGTGTTGCTT" "AACCCAATAATAGTGTTGGTT" "AACCCACTAATGGTGTTGGTT" ...
 $ spike_set2: chr [1:9] "CACCCACTAATGGTGTTGGTG" "AATCCACTAATGGTGTTGCTT" "AACCCAATAATAGTGTTGGTT" "CACCCACTAATGGTGTTGGTG" ...

3.1.1 Plot your logo information with the ggseqlogo() wrapper function¶

We have two options to look at our sequence data. The ggseqlogo() is a wrapper that sets up the entire plot for us including if you'd like to facet the data. If you just want a single logo then you pass parts of the list individually.

If you'd like multiple logos, you can decided their layout with the ncol or nrow parameters. The other parameters are:

  • facet which will accept either "wrap" or "grid".
  • scales which will accept the scale types for the facet geom.
In [38]:
# Adjust our plot window size according to the expected output
options(repr.plot.width=20, repr.plot.height=4)

# Make a 2-row sequence logo
ggseqlogo(example_seq.list, nrow=2) +
    theme(text = element_text(size = 20)) +
    labs(title = "Facet in rows")

# Make a 2-column sequence logo
ggseqlogo(example_seq.list, ncol=2) +
    theme(text = element_text(size = 20)) +
    labs(title = "Facet in columns")
Warning message:
"The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as of ggplot2 3.3.4.
i The deprecated feature was likely used in the ggseqlogo package.
  Please report the issue at <https://github.com/omarwagih/ggseqlogo/issues>."

3.1.2 Plot your logo information with geom_logo()¶

To plot our data we can use ggplot language that we are familiar with and add a layer to our plot with the geom_logo() function. This layer is also supplied by the ggseqlogo package. In order for this to function properly, however, we need to define our parameters as:

  • data: the vector with of our sequence data
  • seq_type: sets the type of sequence like "DNA", "RNA" or "AA"
  • method: the y-scale of our logo either in "bits" (information value) or by "probability"

These and other parameters can also be set in the ggseqlogo() function too.

In [39]:
# Adjust our plot window size according to the expected output
options(repr.plot.width=20, repr.plot.height=4)

# 1. Data
ggplot() + 
    # 2. Aesthetics
    theme_logo() + 
    theme(text = element_text(size = 20)) +
    labs(title = "Facet in columns with geom_logo()") +

    # 4. Geoms
    geom_logo(data = example_seq.list, seq_type="DNA", method = "bits") + 
    
    # 6. Facet
    facet_wrap(~ seq_group, ncol=2)

3.2.0 The Manhattan plot for all your genome-position visualization needs¶

The Manhattan plot is usually used in visualizing data across the genome from LOD scores to marker proportions.

Named for it's visual similarity to the Manhattan skyline of singular towering skyscrapers scattered above low-level buildings, this plot is commonly used for visualizing genomic markers across an ordered linear axis like a series of chromosomes. The y-axis of the values can represent LOD scores or proportions of marker representation. This is frequently used to visualize GWAS or mapping data.

This is another scatterplot that we can generate directly in ggplot with some effort/setup or we can use a package like qqman.

First let's look at the kind of data. Fun fact! You can read in archived/zipped data as well, although be careful, GWAS files can be quite large!

In [40]:
load("data/Lecture06.RData")

str(GWAS_data.df, give.attr = FALSE)
spc_tbl_ [527,210 x 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ CHR           : num [1:527210] 23 23 23 23 23 23 23 23 23 23 ...
 $ POS           : num [1:527210] 2704803 2711723 2723550 2788926 2800788 ...
 $ rsid          : chr [1:527210] "rs774167872" "rs311171" "rs7889852" "rs750465443" ...
 $ SNPID         : chr [1:527210] "X:2704803_T_C" "X:2711723_G_A" "X:2723550_C_T" "X:2788926_A_C" ...
 $ Allele1       : chr [1:527210] "T" "G" "C" "A" ...
 $ Allele2       : chr [1:527210] "C" "A" "T" "C" ...
 $ AF_Allele2    : num [1:527210] 0.0129 0.0116 0.0115 0.0148 0.0147 ...
 $ imputationInfo: num [1:527210] 1 0.589 0.858 0.395 0.784 ...
 $ N             : num [1:527210] 135 135 135 135 135 135 135 135 135 135 ...
 $ BETA          : num [1:527210] -0.235 -1.163 -0.993 -0.902 -1.745 ...
 $ SE            : num [1:527210] 1.218 1.554 1.48 0.993 2.225 ...
 $ p.value       : num [1:527210] 0.847 0.454 0.502 0.363 0.433 ...

3.2.1 Use the manhattan() function to generate a plot¶

To generate our Manhattan plot we can use the manhattan() wrapper function which will require four sets of parameter information:

  • chr: the column name with the chromosome information for a given SNP.
  • bp: the basepair position of the SNP.
  • snp: the ID of the SNP usually a RefSNP ID (RSID) of some kind.
  • p: the p-value for that particular SNP generated from the case vs. control analysis. This will be converted to a $-log10(p)$ value for the y-axis.
In [41]:
# Adjust our plot window size according to the expected output
options(repr.plot.width=20, repr.plot.height=10)
# manhattan(gwasResults, chr="CHR", bp="BP", snp="SNP", p="P")

GWAS_data.df %>% 
# You can filter your data if you want this step to run faster
# filter(CHR %in% c(1:5)) %>%  

manhattan(., chr="CHR", bp="POS", snp="rsid", p="p.value")

3.2.2 Use the highlight parameter to identify markers of interest¶

Now that we've plotted our Manhattan plot, we can see some horizontal lines corresponding to minimum p-values of 1.0 x 10-5 and 5.0 x 10-8 although the significance of these can depend on sample size and allele frequencies of your SNPs.

Either way, you can highlight SNP results based on a provided list. In this case, we can generate a list ourselves by filtering on the p.value variable.

In [ ]:
# Adjust our plot window size according to the expected output
options(repr.plot.width=20, repr.plot.height=10)

snp_candidates <- 

    # Pass along our GWAS data
    GWAS_data.df %>% 

    # Filter it for low p-values
    filter(-log10(p.value) >= 5) %>% 

    # Select SNP candidates
    pull(rsid) %>% 
    unique()

# snp_candidates

# Plot our Manhattan plot
manhattan(GWAS_data.df, 
          chr="CHR", 
          bp="POS", 
          snp="rsid", 
          p="p.value", 
          highlight = snp_candidates)

4.0.0 Class Summary¶

With this final lecture we've covered the spectrum of visualizations covering the most basic of scatter- and barplots, graduating to boxplots, violin plots, and their variants. We've covered high-throughput data visualizations including volcano plots, heatmaps, and principal component analysis. Furthermore we looked at how simple the projection of high-dimension data can be with t-SNE and UMAP.

We finished our course today covering phylogenetic trees, network graphs, and some sequence analysis visualizations. Overall you now have the tools to wrangle data that may appear in all sorts of formats along with a better understanding of when and how to prepare some of the most common data visualizations in your burgeoning science careers.

Congratulations and good luck on your data science journey!


4.0.1 Post-course survey¶

We have created a post-course survey you can fill out anonymously. You can use this survey as an opportunity to tell us about your experience and help shape the future offerings of this series. Please take 5-10 minutes to fill out the survey. We really appreciate your feedback!

Anonymous Google Survey found here


4.1.0 Weekly assignment¶

This week's assignment will be found under the current lecture folder under the "assignment" subfolder. It will include a Jupyter notebook that you will use to produce the code and answers for this week's assignment. Please provide answers in markdown or code cells that immediately follow each question section.

Assignment breakdown
Code 50% - Does it follow best practices?
- Does it make good use of available packages?
- Was data prepared properly
Answers and Output 50% - Is output based on the correct dataset?
- Are groupings appropriate
- Are correct titles/axes/legends correct?
- Is interpretation of the graphs correct?

Since coding styles and solutions can differ, students are encouraged to use best practices. Assignments may be rewarded for well-coded or elegant solutions.

You can save and download the Jupyter notebook in its native format. Submit this file to the the appropriate assignment section by 9:59am on April 25th, 2023.


4.2.0 Acknowledgements¶

Revision 1.0.0: created and prepared for CSB1021H S LEC0141, 03-2021 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.

Revision 1.0.1: edited and prepared for CSB1020H S LEC0141, 03-2022 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.

Revision 1.0.2: edited and prepared for CSB1020H S LEC0141, 03-2023 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.


4.3.0 References¶

The book of ggtree from the Yu Lab: https://yulab-smu.top/treedata-book/index.html

Chapter 10 of ggtree with amazing and complicated plots: https://yulab-smu.top/treedata-book/chapter10.html

More information on the ggraph package: https://cran.r-project.org/web/packages/ggraph/ggraph.pdf

ggseqlogo package information: https://cran.r-project.org/web/packages/ggseqlogo/ggseqlogo.pdf

Manhattan plot tutorial: https://www.r-graph-gallery.com/101_Manhattan_plot.html

GWAS p-value threshold: https://www.nature.com/articles/s41380-020-0670-3?proof=t

More on GWAS thresholds for significance: https://www.nature.com/articles/ejhg2015269.pdf


The Center for the Analysis of Genome Evolution and Function (CAGEF)¶

The Centre for the Analysis of Genome Evolution and Function (CAGEF) at the University of Toronto offers comprehensive experimental design, research, and analysis services in microbiome and metagenomic studies, genomics, proteomics, and bioinformatics.

From targeted DNA amplicon sequencing to transcriptomes, whole genomes, and metagenomes, from protein identification to post-translational modification, CAGEF has the tools and knowledge to support your research. Our state-of-the-art facility and experienced research staff provide a broad range of services, including both standard analyses and techniques developed by our team. In particular, we have special expertise in microbial, plant, and environmental systems.

For more information about us and the services we offer, please visit https://www.cagef.utoronto.ca/.